Thursday, February 11, 2016

Fonts in PDF documents

Fonts in PDF documents

This blog describes how fonts are included (a.k.a embedded) in PDF documents. 

If you want to see which fonts are embedded in a PDF document, then just open it with Adobe Reader and press control-D. Under the 'Fonts' tab you get information like this:
Show embedded fonts in the PDF

So what does a term like "Embedded subset" actually means?


Embedded subset fonts


If a PDF document has an embedded subset font, it means that the documents contains all the information that is needed to draw the used characters of that font. In this example the PDF contains the text "This is a text", and therefore it contains the glyphs (a.k.a. outlines) of these characters.

A glyph defines how a character looks like. Each glyph is basically a set of drawing instructions. A program like PDFRasterizer uses glyphs to draw the characters of a PDF document into a bitmap.

There is only one glyph for the character 'i', although it is used multiple times in the text. Even if the 'i' is used small, large and very large font-sizes the same glyph will be reused.
Different glyphs for the character 'a'

Many characters like "bcdfghjklm..." are not used in the PDF document and so there is no need to embed the glyphs of these. So the PDF document will only embed the glyphs of the characters that are actually used. This keeps the PDF document small in size even if the used fonts contains a huge number of glyphs (think of fonts that contains Korean, Japanese or Chinese glyphs)

Note on forms in which a user can enter text:
In PDF with forms that there is no other choice then to embed all the glyphs of the font. There is no way to predict the used characters as the user may enter any possible character. If you create such PDF documents, and if you are concerned about the size of the file, it may be best to avoid embedded fonts.

Standard fonts

Standard fonts are never embedded in a PDF

There are 14 fonts that are never embedded. These are there Times, Courier, Helvetica, Symbol and Zapf Dingbats, some with variations like bold or italic. These 14 fonts consists of these 5 fonts and some variations. The PDF specification mandates that these must always be available and therefore there is no need to embed these.

These fonts are usually present on the operating system and are used when a PDF file is created or when it is read. These fonts not only contains the glyphs but additional information like the width of each character.


Font substitution

There may be situations in which the used font don't give the desired results, whether they are embedded or not. This is where the font substitution map is designed for. It just maps any font that is used in the PDF to a different one.

This following code sample translates the fonts named "Demo Font" and "Demo Font Bold" to variants of arial.

settings = new RenderSettings();
settings.TextSettings.FontSubstitutionMap.Add("Demo Font", "ariali.ttf");
settings.TextSettings.FontSubstitutionMap.Add("Demo Font Bold", "arialbi.ttf");

This has the effect that all characters that uses the font 'Demo font' will be rendered using the glyphs in the font 'ariali.ttf'.

Note on character widths:
In a font each character will have its own width. Usually an 'i' is narrow and a 'm' is wide. But this is not always the case. Most notable are the monospaced fonts like courier an 'i' and the 'm' have exactly the same width. This difference may vary in each font that you use for substitution. So in the font arial these differs significantly but in a font like courier the width is exactly the same.
Font substitution may go wrong

Because of this the spacing between the characters in text may look ugly. In the picture above the spacing between the character 'm' and 'o' is very different compared to the original on the left. So if you substitute a font then pick one that resembles the character widths of the original as closely as possible.

Font resolving

It is also possible, even common, that a PDF document does use a font that is neither a standard font nor an font that is embedded. In that case, it is up to the application to decide how to render the text. In general this is done by selecting an available font that matches the font parameters closely (most importantly the name but there are more aspects). For the substitution the FontSeachPath is also used, this allows an application to control the fonts that can be used for this.

font search path

Monday, January 11, 2016

Making Adobe Reader default for PDF in Windows 10

This morning started with a rather boring and common activity, scanning a document and modifying the resulting PDF.

To my surprise the PDF opened in Internet Explorer. Apparently, after updating to Windows 10, many default programs have been changed. The scanned PDF needed to be rotated, but I can not find any tools to achieve this using Internet Explorer, and I am not very motivated to do so either. In Adobe it takes around 3 seconds to rotate the PDF. Therefore this very short blog explains how you can set the default back to Adobe Reader.

Before opening the document you can right-click > "open with" > select "Adobe Reader" and open the file with Adobe Reader. This will not change the default setting, since it will only work for 1 document at a time. You will get a message asking you if you would like to set Adobe Reader as default. Unfortunately this will not work in Windows 10.

Open you PDF in Windows 10 with Adobe Reader as deault:

  1. Open the start menu
  2. Type "default app settings" and select the result
  3. Click on "chose default application by file type"
  4. Go down to ".pdf" (The list is in alphabetical order) and click on the Internet Explorer symbol.
  5. Select "Adobe Reader"
That's it.

Thursday, January 7, 2016

Submitting, processing and responding to PDF form data

If you have to choose between an HTML form and a PDF form - or maybe you are required to support both - then it is good to know about the differences between these two forms and what they have in common. The scope of this article is restricted to classic PDF forms - as opposed to XFA forms.

HTML form

An HTML form looks like this:

<h1>I want pizza!</h1>
<form method="get" action="order">
   <p>Choose a size:</p>
   <input type="radio" name="size" value="small">small<br>
   <input type="radio" name="size" value="medium">medium<br>
   <input type="radio" name="size" value="large">large
   <p>Choose ingredients:</p>
   <input type="checkbox" name="tomatoes" value="tomatoes">tomatoes<br>
   <input type="checkbox" name="onions" value="onions">onions<br>
   <input type="checkbox" name="tuna" value="tuna">tuna<br>
   <input type="checkbox" name="cheese" value="cheese">cheese
   <p>My name:</p>
   <input type="text" name="name" /><input type="submit" value="order" />
</form>

Both the field itself and it how it looks are represented by the same element, namely the input element. PDF on the other hand, separates the notion of a field and its representation completely.

PDF form

PDF fields are defined at document level and each field may have zero or more visual represenations called widgets. Each widgets is associated with a page. The following diagram shows this structure and how document, page, field and widget are related:

Throughout this article I will use the following PDF form:


I used notepad to create the text part and then printed it to PDF. Next, I used Adobe Acrobat Pro DC to add the form elements. The size options are radio buttons that share the same group name "size". The radio buttons "small", "medium" and "large" are part of the same group named "size". The ingredients are checkboxes with corresponding names. Finally there is a textbox named "name" and a button named "order".

Add a form submit button to the PDF

In order to submit a PDF form to a web endpoint, you need to add a button with a submit form action. You typically do this in Adobe Acrobat. Here is what the actions tab of the button properties dialog looks like after adding a submit form action:


If you select the action and click the Edit button, you will see the available options for submitting form data:


The selected export format is HTML. This will POST all form data to the specified URL when the button is clicked. Note that in contrast to an HTML form it is not possible to specify GET as the HTTP method. Later on we will see how to handle this request in an ASP.NET MVC application.

Open the PDF form

Let's open this form in the browser and see what happens. You can open it from here: http://www.tallcomponents.com/demos/pizza/form.

As an implementation note, the form is located inside the Content folder of an MVC app and the action method looks like this:

public class PizzaController : Controller
{
   public ActionResult Form()
   {
      return File("~/Content/order-pizza.pdf", "application/pdf");
   }
}

There is a good chance that your browser will render the PDF form itself instead of using the Adobe Reader plug-in. Google Chrome renders the PDF as HTML and consequently breaks a great deal of PDF features, including submitting form data. Edge does the same thing. In fact, all modern web browsers have stopped supported the NPAPI plug-in infrastructure on which the Adobe Reader plug-in relies. If you click the order button in the browser, nothing happens.
c
This is why Adobe made it possible to submit form data using the latest versions of Adobe Reader. Earlier version of Adobe Reader did not allow this unless your document was Reader extended. (If you know exactly when this changed entered Adobe Reader, then please leave a comment. I tried to Google it but without success.)

To get the full PDF experience when opening PDF documents or forms from the web, you must disable your browser's PDF viewer. Here are the steps for Google Chrome:
  1. Browse to chrome://plugins
  2. Click the disable link of the Chrome PDF Viewer
(Google for similar instructions for other browsers.)

If you now open the form in your browser using the same link, your default system PDF viewer (make sure it is Adobe Reader) opens the PDF outside the browser like this:


Submit form data from Adobe Reader

Clicking the order button from Adobe Reader submits the form data to endpoint http://www.tallcomponents.com/demos/pizza/order. Here is the ASP.NET MVC controller action that handles this request:

public class PizzaController : Controller
{
   [HttpPost]
   public ActionResult Order(Pizza pizza)
   {
      return View(pizza);
   }
}

Model Pizza:

public class Pizza
{
   public string Size { get; set; }
   public string Tomatoes { get; set; }
   public string Onions { get; set; }
   public string Tuna { get; set; }
   public string Cheese { get; set; }
   public string Name { get; set; }
}

View Order.cshtml:

@model Pizza
<h2>Hi @Model.Name!</h2>
<p>
   Thanks for ordering a @Model.Size pizza.
   Tomatoes: @Model.Tomatoes.
   Onions: @Model.Onions.
   Tuna: @Model.Tuna.
   Cheese: @Model.Cheese.
</p>

Note how MVC takes care of mapping form data to members of Pizza based on their names.

After clicking the order button, the following dialog displays:


After clicking Allow, Adobe Reader asks permission to open the response:


Apparantly, Adobe Reader saves the response to a temporary location. After clicking Yes, the default browser displays the response:


This is as expected but far from a great user experience.

Return a PDF Response 

The previous use case returned HTML as a response. Consequently, a browser instance opens and displays the HTML. Let's ee what happens if we return the response as PDF.

I have created a second version of the order pizza form that you can open from here: http://www.tallcomponents.com/demos/pizza/form2.

The order button of this PDF submits the data to a second endpoint that returns a PDF response using PDFKit.NET as follows:

[HttpPost]
public ActionResult Order2(Pizza pizza)
{
   Document document = new Document();
   Page page = new Page(PageSize.Letter);
   document.Pages.Add(page);

   double margin = 72; // points
   MultilineTextShape text = new MultilineTextShape(
      margin, page.Height - margin, page.Width - 2 * margin);
   page.Overlay.Add(text);
   Fragment fragment = new Fragment(
      string.Format("Hi {0}!, thanks for ordering a {1} pizza!", 
         pizza.Name, pizza.Size),
      Font.Helvetica,
      16);
   text.Fragments.Add(fragment);

   Response.ContentType = "application/pdf";
   Response.AppendHeader("Content-disposition", "attachment; filename=file.pdf");
   document.Write(Response.OutputStream);

   return null;
}

If I now click the order button, a new instance opens showing the following response:


Return flattened PDF form

A form is said to be flattened if all fields have been replaced by non-editable graphics corresponding to the form data. Note that the fields have not just been disabled or made read-only but they have been removed entirely and replaced with non-interactive content.

Let's see how we can return the flattened form as a response.

I have created a third version of the order pizza form that you can open from here: http://www.tallcomponents.com/demos/pizza/form3.

The order button of this PDF submits the data to a third endpoint that uses PDFKit.NET to merge the submitted data with the original form and flattens the form as follows:

[HttpPost]
public ActionResult Order3(Pizza pizza)
{
  using (FileStream file = new FileStream(
    Server.MapPath("~/Content/order-pizza3.pdf"),
    FileMode.Open, FileAccess.Read))
  {
    // import submitted data into original form
    Document document = new Document(file);
    FormData data = FormData.Create(System.Web.HttpContext.Current.Request);
    document.Import(data);

    // flatten form
    foreach (Field field in document.Fields)
    {
      foreach (Widget widget in field.Widgets)
      {
        widget.Persistency = WidgetPersistency.Flatten;
      }
    }

    Response.ContentType = "application/pdf";
    Response.AppendHeader("Content-disposition", "inline; filename=file.pdf");
    document.Write(Response.OutputStream);

    return null;
  }
}

If I now click the order button, a new instance opens showing the following response:


The fields still look like fields, but they are actually graphics.

Download

Download the ASP.NET MVC project including PDF forms.

Tuesday, December 1, 2015

Extract text that is stored in a PDF document

Extract text from a PDF document

The purpose of PDF is to provide information that is readable by humans. It goes to great lengths to provide documents with very clear typography and graphics. 

Its purpose is to format information so that it can be printed or shown on a screen, so that a human can read it or interact with it. Its purpose is not to format information in a way that can be read by a computer. So if you have a PDF document and you want to extract the text from it, it may become a bit complicated.
One might expect that in a PDF document the text is always somewhere present. So that the only problem is knowing where to look. However, this is not the case.

How characters appear to be stored in PDF

Just a bunch of fragments

How text is actually stored in PDF

The text can best be seen as a bunch of small fragments that are scattered across a page. Do not expect any ordering that makes sense in a semantic way. Furthermore, each fragment is in fact one or more glyph-id's together with a location on the page (A glyph-id is a number that identifies the way the glyph must be drawn). So these numbers do not necessarily have a relation with the Unicode values that you want, as it only describes how they will look like for a human. 

Extracting and sorting 


In order to extract the text so it can be processed by a computer, there are two steps to take:

  1. Firstly, the glyph-id's must be converted to the character-id's. Some TALLcomponents products will do that for you. 
  2. Secondly, these must be sorted so that the text is extracted in the right order. A good start is to sort these first from top to bottom and second from left to right. In our code samples there are some examples for that.
This is a good start for text that starts top left and must be read to bottom right. If not, the sorting algorithm must be changed (which is not that difficult).

Superscript and subscript

But there are more problems, and one of them is superscript and subscript:
Superscript characters in PDF

If these are just sorted vertically and horizontally, then the result would be "Hello note1 World", while it should have been "Hello World note1". 

There is a code sample written for PDFControls.NET that solves this. You can find it in: KB000315. The code examines the amount of vertical overlap and decides, which must be a factor of the height of the text on the base-line.

Flat low characters 

Flat low characters like the underscore

The same problem exists for the flat low characters like the underscore. When sorted these characters may end up in the next line, while is is actually position in between the words of the current line. 

This is also handled in the code sample, in which the flat low characters are recognized and the height is modified so that they are sorted correctly.

Multiple columns

PDF Text with multiple columns

Extracting text gets really complicated when the document layout becomes more complex, i.e. when there is more than one column on which the texts is wrapped. A PDF document does not have any information on columns, so its hard to recognize them.

This is not addressed in the code sample. For further reading, there is an article here on this topic: Searching Text and Recognizing Columns. It also contains a code sample that detects these columns.

More

There are many other problems that can arise which can make the extraction of text difficult, like mathematical formulas, rotated text, creative layouts and so on. But this is not addressed in this blog.
For now it is sufficient to say that extracting text was never a design goal of the PDF specification, and therefore it can be complex to work around that.




Tuesday, November 10, 2015

Monadic templating in C# - Part 2

In the previous article (Part 1), I had introduced a prototype of a monadic framework for generating formatted output (e.g. source code) from some complex, hierarchical set of data (e.g. Abstract Syntax Tree of some input language). This second part picks up where the first part left, the explanation of the actual templating by utilizing the string interpolation feature of C# 6.0.

Introduction


Generating formatted output with lightweight templates has never been easier than with the string interpolation feature of C# 6.0. This feature renders some (used to be very useful) templating libraries, e.g. SmartFormat obsolete. String interpolation basically means that one can insert arbitrary expressions into a string by decorating it with curly brackets. We can finally forget String.Format and the numbered placeholders...

In this article, first, string interpolation is introduced briefly by contrasting it with some earlier approaches. Following that I'll show how to marry string interpolation with the previously introduced monadic generators. 

Lightweight string templating


In the past we had two options for lightweight templates. The most obvious choice was the ubiquitous String.Format. It gets a template in its first argument, and uses varargs for passing the objects referenced by the template (the exact same idea used in C since the seventies). The template language is simplistic, references are indexes of the arguments wrapped into curly brackets (if one wants to output curly bracket characters, those must be duplicated for escaping): 

String.Format("function {0}({1}){{{2}}}", name, arguments, body)

It is a very unpractical approach as it is very hard to read. It is mainly because there of the indirection in the template and that the indirection is achieved by numbered indexes. Numbered indexes are great for computers, but highly demotivating for humans.

A better approach is what used by e.g. SmartFormat, named placeholders:

Smart.Format("function {name}({arguments}){{{body}}}", function)

Unfortunately, this feature is limited. It works only if the template has exactly one object argument (the names in the template correspond to the members of the object then). It helps a bit that anonymous types can be involved:

Smart.Format("function {name}({arguments}){{{body}}}", new {name, arguments, body})

It is not bad any more, only those indirect references wouldn't be there...

This is when string interpolation comes into the picture:

String res = $"function {name}({arguments}){{{body}}}");

It is just great. It directly provides strings, and arbitrary expression can be embedded in the placeholders. It is exactly what we need.

Customizing string interpolation


Previously I was not completely honest, string interpolation actually does not provide string directly, instead of an object of type FormattableString. This can be stringified through its ToString method by a custom IFormatProvider.

When we want to execute our generators (a kind of state monad), we need to provide an initial state.
Thus all we have to do to use Generator<String> types expressions with string interpolation, is to use a stateful IFormatProvider:

class GFormatProvider : IFormatProvider, ICustomFormatter
{
    private Context ctx;

    public GFormatProvider(Context ctx)
    {
        this.ctx = ctx;
    }

    public object GetFormat(System.Type formatType)
    {
        if (formatType == typeof(ICustomFormatter)) return this;
        return null;
    }

    public string Format(string format, object arg, IFormatProvider formatProvider)
    {
        if (arg == null)
            return string.Empty;

        // This is why we need the covariant type variable
        if (arg is IGenerator<object>)
        {
            arg = ((IGenerator<object>)arg).Run(ctx).Item1;
        }

        if(arg is IFormattable)
        {
            return ((IFormattable)arg).ToString(format, formatProvider);
        }
        else
        {
            return arg.ToString();
        }
    }
}

The provider is instantiated with a Context (the modified Context is dropped on purpose, but it could be returned if required). The Format method is executed for every placeholder, and it speaks for itself. It's worthwhile to note though that this is where we exploit that IGenerator<T> is covariant. Otherwise, an arbitrary IGenerator<T> couldn't be cast to IGenerator<object>.

The only missing piece now is the utility method in Generator<T> to hide the custom  IFormatProvider:

public class Generator<T> : IGenerator<T>
{

    ...

    public static Generator<String> Template(FormattableString formattable)
    {
        return Wrap(delegate (Context ctx)
        {
            var provider = new GFormatProvider(ctx);
            var res = formattable.ToString(provider);
            return CovariantTuple<String, Context>.Create(res, ctx);
        });
    }
}

Conclusion


With this straightforward extension to string interpolation, our prototype monadic generator/templating library is ready to make experiences. I hope its complexity does not hide the beauty in this monadic approach.

And finally, as usual, the completely source cod can be downloaded from the following GitHub repository: https://github.com/tallcomponents/MonadicTemplate.

Monday, November 9, 2015

Basics of PDF graphics and how to edit

A question that I often hear is: "how can I change the graphics of my PDF such as replacing text with some other text or replace a logo with another logo?" In general this is not a good idea. PDF is not designed for editing; it is designed as an end format much like ink on paper. Nevertheless there may be circumstances - such as when you don't have access to the source format - when editing a PDF is a requirement. This article explains the basics of PDF graphics and how graphics can be edited if you really have to.

PDF Graphics

A PDF document contains various types of information such as metadata (author, title, etc.), form fields, navigational data such as bookmarks, annotations such as comments, and last but not least graphics. Graphics can roughly be divided into three categories: curves, text and images. The graphics on a PDF page are described by a sequence of operators. Operators can be divided into 3 groups: 
  1. draw operators that draw curves, text and images
  2. graphics state operators that do things such as selecting a font, selecting a color or transforming the coordinate system (more about that later)
  3. marked-content operators that associate high-level information with graphics but do not affect the appearance. I ignore this group for the purpose of this article.
Each operator takes zero or more operands. Here is a simple example that draws a straight red line:

150 250 m      % set the current point to (150, 250)
150 350 l      % append a straight line to (150, 350)
1 0 0 RG       % set the stroke color to red
S              % stroke the line

The operator follows the operands that are used by the operator. On the first line, operator 'm' uses operands 150 and 250. 

Here is an example that involves text:

/F1 24 Tf        % set font to F1 and font size to 24
100 100 Td       % move text position to (100, 100)
(Hello World) Tj % draw the text 'Hello World'

On the first line, operator Tf takes operands /F1 and 24. Operand /F1 is the name of a font. Without going into the full details, it suffices to say that /F1 can be resolved to an actual font either inside or outside the PDF document. 

And finally, here is a one-liner that draws an image:

/I1 Do           % draw image

Similar to selection a font, /I1 is resolved to an actual image inside the PDF document. 

Graphics state

As said, operators can be divided into draw operators and graphics state operators. When the operators are processed from top to bottom a graphics state is maintained. The graphics state operators change the graphics state and the result of draw operators are affected by the graphics state. In the first sample we saw that the RG operator changed the stroke color to red and the S operator draws a line using the current stroke color.

Other graphics state operators set the line width, dash pattern, fill color, font size etc. Finally, there are two special operators that, respectively, save (q) and restore (Q) the graphics state. Simply put: the restore operator changes the graphics state back to the state at the previous save operator. They appear pair-wise and can be nested.

Coordinate System

A crucial part of the PDF imaging model is the coordinate system. The coordinate system determines where on the page a given coordinate such as (150, 250) is located and what the extend of a size is. PDF defines different coordinate systems. The most important two are user space and device space.

Device space

The device space is determined by the output device such as a printer or display on which a PDF page is ultimately rendered. Let's say that we want to render a PDF page to a Windows bitmap at 300 DPI, then from a Windows development perspective, the device space has its origin at the top-left corner, the x axis points to the right, the y axis points downwards, and the length of a unit (a pixel) is 1/300 inch.

User space

As opposed to the device space, the user space is device independent. For every page, it is initialized such that its origin lies at the bottom-left corner, the x axis points to the right, the y axis point upwards and the length of a unit is 1/72 inch or 1 point. The coordinates in the above PDF operator examples are in user space.

Mapping user space to device space

How these coordinates are mapped or transformed to coordinates in the device space is defined by the current transformation matrix or CTM. Let's see how this would look in code:
// width and height of a Letter page
float width = 612; // 612 points = 8.5 inches
float height = 792; // 792 points = 11 inches

// output device is a 600 dpi bitmap
float dpi = 600;
Bitmap bitmap = new Bitmap((int)(width * dpi / 72), (int)(height * dpi / 72));

// 4 page corners in user pace 
PointF[] points = new PointF[] { 
   new PointF(0, 0),           // bottom-left corner
   new PointF(0, height),      // top-left corner
   new PointF(width, height),  // top-right corner
   new PointF(width, 0)        // bottom-right corner
};
Console.WriteLine(
   string.Join("; ", points.Select(p => string.Format("({0}, {1})", p.X, p.Y))));

// calculate the coordinates of the corners in device space
Matrix ctm = new Matrix();
// flip vertical axis
ctm.Scale(1, -1);
ctm.Translate(0, -bitmap.Height);
// resolution
ctm.Scale(dpi / 72f, dpi / 72);
ctm.TransformPoints(points);
Console.WriteLine(
   string.Join("; ", points.Select(p => string.Format("({0}, {1})", (int)p.X, (int)p.Y))));

Changing the user space

The CTM is part of the graphics state and it can be changed using the cm operator. The cm take six operands that represent a transformation matrix. Changing the CTM will affect subsequent draw operators as you will see in the following example.

We have a page that measures 200 pt by 200 pt. The following image shows the empty page with the user space coordinate system laid on top of it:

We draw a red square measuring 50 by 50 and a smaller blue square measuring 25 by 25 inside the red square like this:

Next we transform the user space by translating it by (50, 75). Note that this is done before the figure is drawn.

Finally, the user space is rotated 30 degrees like this:



So instead of transforming the squares, we transform the user space and then draw the squares inside that user space. Depending on where you are coming from, this may feel counter-intuitive.

Shapes

From a development point of view, a sequence of operators is not a convenient format. E.g. you can not easily navigate to an image on the page and retrieve its position. Its properties depend on the accumulation of all previous operators so you would have to process all of them first. The same is true for text and curves. 

Changing a graphic, such as moving a single image or rotating a piece of text would be even harder because you would have to insert operators in such a way that they would only affect the targeted graphic.

PDFKit.NET allows you to extract all graphics on a page as a collection of shape objects. Internally it will do all the hard work of interpreting the operators, creating shape objects from draw operators and assigning properties that reflect the current graphics state. After extracting the shapes, you can remove shapes, insert new shapes and change their respective properties. When done, you can write the shapes back to a PDF page. This will in turn generate the required sequence of operators and operands.

Example: Replace a logo

To demonstrate the use of shapes to edit graphics, we are going to replace a logo. See below the images of the original PDF and the PDF after replacing the logo:

Here is all the code:

static void Main(string[] args)
{
   using (FileStream fileIn = new FileStream(
      "indesign_shortcuts.pdf", FileMode.Open, FileAccess.Read))
   {
      Document pdfIn = new Document(fileIn);
      Document pdfOut = new Document();

      foreach (Page page in pdfIn.Pages)
      {
         ShapeCollection shapes = page.CreateShapes();
         replaceLogo(shapes);

         // add modified shapes to the new document
         Page newPage = new Page(page.Width, page.Height);
         newPage.Overlay.Add(shapes);
         pdfOut.Pages.Add(newPage);
      }

      using (FileStream fileOut = new FileStream(
         "out.pdf", FileMode.Create, FileAccess.Write))
      {
         pdfOut.Write(fileOut);
      }
   }
}

static void replaceLogo(ShapeCollection shapes)
{
   for (int i = 0; i < shapes.Count; i++)
   {
      Shape shape = shapes[i];
   
      if (shape is ShapeCollection)
      {
         // recurse
         replaceLogo(shape as ShapeCollection);
      }
      else if (shape is ImageShape)
      {
         ImageShape oldLogo = shape as ImageShape;
         shapes.RemoveAt(i);

         ImageShape newLogo = new ImageShape("new-logo.png");
         newLogo.Transform = oldLogo.Transform;
         newLogo.Width = oldLogo.Width;
         newLogo.Height = oldLogo.Height;

         shapes.Insert(i, newLogo); 
      }
   }
}

Friday, November 6, 2015

Single Page ASP.NET Application for Splitting and Stitching PDF Documents

Download source code


This post shows a single page ASP.NET application that allows the user to:
  • Upload PDF documents
  • Drag and drop pages between PDF document and assemble new documents
  • Download the modifed PDF documents

The following technologies are used:

How the application works

Before going into the implementation details, let's take a look at the application itself:

Now that we have seen what the application does, let's take a detailed look at the code. The full project can be downloaded from here. In the text I will include code snippets and refer to the location of the source file in the project. Sometimes I will simplify the snippet for readability.

Design overview

The application is implemented as a single controller (/Controllers/HomeController.cs), a home view (/Views/Home/Index.cshtml) and a partial view (/Views/Home/_Panel.cshtml). Roughly, the application implements the following functionality:
  • Uploading a PDF and rendering the PDF as a list of pages
  • Dragging and dropping pages between documents
  • Downloading a new PDF document
The following diagram and pseudo-code shows the steps involved in uploading a PDF document and rendering the page thumbnails.


After uploading, each page is rendered as an img element that encodes its origin using data-guid and data-index attributes. When pages are dropped to another panel, this panel includes all information required to ask the server to create a new PDF document from the guid/index pairs. This is shown in the following diagram.


In the remainder of the article I will discuss all parts in detail by taking a look at the server and the client code.

Upload


For uploading, we use the jQuery File Upload Plugin. The HTML for the upload button can be found in /Views/Home/Index.cshtml and looks like this:
<span class="btn btn-success fileinput-button">
   <i class="glyphicon glyphicon-plus"></i>
   <span>Upload PDF...</span>
   <!-- The file input field used as target for the file upload widget -->
   <input id="fileupload" type="file" name="files[]" multiple>
</span>
Here is the client-side JavaScript event handler of the upload button:
$('#fileupload').fileupload({
   url: '@Url.Action("Upload", "Home")',
   dataType: 'html',
   sequentialUploads: true,
   done: function (e, data) {
      addPanel(data.result);
   }
});
The url argument points to the Upload action of the Home controller which can be found in /Controllers/HomeController.cs. This action method is quite simple. It just saves the uploaded PDF in an upload folder (using a new guid for the file name) and returns a partial view that displays the panel with the page thumbnails:
[HttpPost]
public ActionResult Upload()
{
   HttpPostedFileBase file = Request.Files[0];
   Guid guid = Guid.NewGuid();
   kit.Document pdf = new kit.Document(file.InputStream);
   file.SaveAs(Server.MapPath(string.Format("~/Upload/{0}.pdf", guid)));
   
   return PartialView("_Panel", new PanelModel() { 
      DocumentGuid = guid.ToString(), 
      Document = pdf 
   });
}
The HTML of the returned view (discussed next) is available on the client side through the done callback of the fileupload function. This function passes it to the helper function addPanel that looks like this:
function addPanel(html) {
  var id = guid();
  // prepend html to div with id 'panels' (with a slide effect)
  $(html)
    .hide()
    .prependTo('#panels')
    .slideDown()
    .attr('id', id);
  // make the list of pages draggable
  $(".pageslist").sortable({
    connectWith: ".pageslist",
    stop: function (event, ui) {
      updatePanels();
    }
  }).disableSelection();
  
  updatePanels();
  updateToolbar();
}
It first prepends the HTML to the div with id 'panels' (with a slide effect). Next, it makes the list of pages draggable (discussed later).

Display the PDF document using a partial view



The Upload method returns partial view _Panel (see /Views/Home/Panel.cshtml) which is rendered from PanelModel. The Panel model is nothing more than a guid/document tuple. The document is an instance of the PDFKit.NET Document class. This third-party library used to programmatically combine pages from documents to create a new document. It is important to note that the Document instance only lives during the rendering of the partial view. It is discarded thereafter. The filename (the guid) is stored as an attribute in the HTML so that the PDF document can be loaded when needed. There is no in-memory server state whatsoever.
Here is the code of partial view _Panel:
@model PanelModel
<div class="panel">
  <div class="panelheader">
    <a href="#" class="closepanel pull-right"><i class=" fa fa-close"></i></a>
    <button type="submit" class="download btn btn-xs btn-primary">Download</button>
  </div>
  <div class="pagesarea">
    <ul class="pageslist">
    @for (int i = 0; Model.Document != null && i < Model.Document.Pages.Count; i++)
    {
      // enumerate the pages of the document
      Page page = Model.Document.Pages[i];
      // calculate the width of the thumbnail (18 dpi)
      int width = (int)((PanelModel.THUMBRES / 72f) * page.Width);
      int height = (int)((PanelModel.THUMBRES / 72f) * page.Height);
      // the src point to the Thumbnail action
      // attributes data-guid and data-index store the 
      // page origin for later retrieval
      <li class="ui-state-default">
        <img class="pagethumbnail" src="/Home/Thumbnail?d=@Model.DocumentGuid&i=@i" 
          width="@width" height="@height" data-guid="@Model.DocumentGuid" data-index="@i" />
      </li>
    }
    </ul>
  </div>
</div>
The PDF panel uses the bootstap classes panel and panelheader. The panel header has a download button (discussed later) and a close button. I refer to the full source code for the close button. The download button is discussed later.
The body of the panel is an unordered list. The list items are the thumbnail images for the different pages. The following CSS styles the unordered list so there is no bullet and the items run from left to right:
ul.pageslist { 
    list-style-type: none; 
    float: left;
    margin: 0; 
    padding: 0; 
    width: 100%;
    min-height: 100px;
}
The object model of PDFKit.NET is used to enumerate the pages of the document. The interesting parts are the src, data-guid and data-index attributes of the img element. The src attribute points to a Thumbnail action method that dynamically renders the page thunbmnail. The URL includes the guid of the document and the page index. The Thumnail action is dicussed next.
The data-guid and data-index attributes of the img element make the page (thumbnail) self-describing. When the page is dragged (discussed later) to another panel, the data-guid and data-index are included and continue to identify the original document and page. This way, all state is maintained in the browser. How this state is used to download the new document is explained shorty.

Render page thumbnails

The src attribute of the page thumbnail image points to the following action method (see /Controllers/HomeController):
public ActionResult Thumbnail(string d, int i)
{
  // open the file in the upload folder identified by d
  using (FileStream file = new FileStream(
    Server.MapPath(string.Format("~/Upload/{0}.pdf", d)), 
    FileMode.Open, FileAccess.Read))
  {
    // contruct a PDFRasterizer.NET document
    // and get the page at index i
    Document pdf = new Document(file);
    Page page = pdf.Pages[i];
    float resolution = PanelModel.THUMBRES;
    float scale = resolution / 72f;
    int bmpWidth = (int)(scale * page.Width);
    int bmpHeight = (int)(scale * page.Height);
    // render the page to a 18 DPI PNG bitmap
    using (Bitmap bitmap = new Bitmap(bmpWidth, bmpHeight))
    using (Graphics graphics = Graphics.FromImage(bitmap))
    {
      graphics.ScaleTransform(scale, scale);
      page.Draw(graphics);
      // save the bitmap to the HTTP response
      bitmap.Save(Response.OutputStream, ImageFormat.Png);
    }
  }
  return null;
}
Looking at this code, arguments d and i are respectively, the file name and page index of the page to render. PDFRasterizer.NET is used to render this page to a gdi+ bitmap. this bitmap is then saved to the output stream as a png. the browser will display the thumbnail.

Drag and drop pages



For client side drag and drop, we use the sortable interaction of jQuery UI. The list of pages per panel is made sortable inside the addPanel function that we saw earlier, using the following code:
function addPanel(data) {
  ...
  $(".pageslist").sortable({
    connectWith: ".pageslist",
    stop: function (event, ui) {
      updatePanels();
    }
  }).disableSelection();
  ...
}
When a page is dragged from one panel to another, no call is made to the server. All information required to download the new PDF document is stored client-side by means of the data-guid and data-index attributes of the thumbnail img elements.

Download new PDF document



Downloading the new document is the most interesting part. Before looking at the code, let me outline the 5 steps involved:
  1. CLIENT: The click handler of the download button enumerates the page thumbnails and creates an array of doc-guid/page-index tuples.
  2. CLIENT: The JSON representation of this array is POSTed to the Download action on the server.
  3. SERVER: Because a POST request cannot trigger the browser to open a file, it temporarily stores the JSON as a file and returns a guid identifying the JSON file.
  4. CLIENT: Next, the client makes a GET request to another Download action and passes the same guid.
  5. SERVER: The server reads the JSON from the temporary file, deletes the file, creates the PDF and writes it to HTTP response.
The click handler of the download button looks like this. Note that we use delegated events because download buttons are created dynamically. This handler includes all 3 client-side steps described above.
$(document).on('click', 'button.download', function () {
  // enumerate all the pages and create
  // an array of document guid/page index tuples
  pages = new Array();
  var id = $(this).parents('div.panel').attr('id');
  $('#' + id).find('img.pagethumbnail').each(function () {
    pages.push({ "Guid": $(this).attr('data-guid'), "Index": $(this).attr('data-index') });
  });
  
  // POST array of tuples to download action
  $.ajax({
    type: "POST",
    url: '@Url.Action("Download", "Home")',
    data: JSON.stringify(pages),
    contentType: "application/json",
    dataType: "text",
    success: function (data) {
      // request the PDF
      var url = '@Url.Action("Download", "Home")' + '?id=' + data;
      window.location = url;
    }
  });
});
The server part consists of two action methods: One stores the JSON, the other creates and returns the PDF. This technique and alternatives are also discussed on StackOverflow. Here is the POST action:
[HttpPost]
public ActionResult Download(PanelPage[] pages)
{
  // create a JSON string from the PagePanel array
  string json = new JavaScriptSerializer().Serialize(pages);
  // save the JSON to ~/download/<newguid>.json
  string id = Guid.NewGuid().ToString();
  string path = Server.MapPath(string.Format("~/Download/{0}.json", id));
  System.IO.File.WriteAllText(path, json);
  // return the guid that identifies the JSON file
  return Content(id);
}
public class PanelPage
{
  public string Guid { get; set; }
  public int Index { get; set; }
}

Show/hide download button

The download button at the top of a panel is only shown if the panel has atleast one page, otherwise there is nothing to download. This is taken care of by the updatePanels function which is called whenever a new panel is added or when a page is dragged from one panel to another:
function updatePanels() {
  $('.panel').each(function () {
    if ($(this).find('.pagethumbnail').size() == 0) {
      $(this).find('button.download').hide();
    }
    else {
      $(this).find('button.download').show();
    }
  });
}