Convert PDF documents into HTML

How to convert PDF documents into HTML

Why convert PDF documents into HTML web pages?

Documents posted to your website fall under ‘web content’ and therefore must follow the Accessibility guidelines. The easiest way to comply with the guidelines surrounding documents is to convert them to an HTML (web page) equivalent.

Introduction

When inserting content from a PDF file into your web page, the format of the text in the PDF can sometimes be different when pasted into the web page editor. In order to minimize the number of formatting inconsistencies when converting PDF documents to HyperText Markup Language (HTML), it is recommended that you convert the PDF to rich text format before inserting content. 

Converting a PDF document to rich text format

Note: This conversion may format text incorrectly once it is pasted into the WYSIWYG editor. It may insert extra spaces in sentences, capitalize letters that were previously in lower case and insert paragraph breaks within paragraphs. Please ensure that the text that is ultimately pasted into the WYSIWYG editor on the WCMS is properly formatted. A good way to check is to scan the document once it is pasted into the editor, as all checks can be made with one read through.

  1. Open the PDF which you wish to add to your site, using Adobe Acrobat.

    1. Right click on the PDF file.

    2. Select Edit with Adobe Acrobat.

  2. Save file in rich text format.

    1. Select File in Acrobat toolbar.

    2. In the dropdown, select Save As Other > More Options > Rich Text Format.

  3. Open the rich text format file using Microsoft Word.

  4. Use Ctrl + a to select all text in the document.

  5. Use Ctrl + c to copy selection.

  6. Use Ctrl + v to paste copied content into the body field of your web page.

  7. Ensure that content is properly formatted in the WYSIWYG editor. 

Notes regarding formatting changes

While this method applies some appropriate formatting to the content, there are still some required features for accessibility and usability that this method will not automatically add. Here are some steps to ensure that pasted content is accessible and useable:

  • Align all headings and text to the left

    Any center, left justified, or right justified attributes will be kept when pasting in content. It is recommended that you remove these attributes by clicking on the source button and removing all instances of alignment="center, left, right".

  • It is recommended that you migrate the text in segments, preferably paragraph by paragraph or title by title.

    This will help prevent most formatting spacing errors, unconverted text, and repeated header and footers in the pages.

  • Whenever pasting content from any document (PDF, rtf, docx and others), ensure that the text content is accessible and useable.

    Not all symbols may convert properly when pasted into the WYSIWYG editor. Periods, commas, question and exclamation marks), as well as symbols (%, $, —, etc.) do not appear on the web page. Any subscript and superscript formatting is also stripped. 

  • Insert line breaks manually

    Since line breaks are treated as pictures by the .rtf, you must insert them manually on your web page.

  • Insert title headings manually.

    Since titles can take many formats in .rtf and .pdf files, you should decide the appropriate heading (h2, h3, h4, etc.) that needs to be incorporated in the web page.

  • Insert text in frames/tables separately or format them into pictures with appropriate alt text.

    Since text in frames and tables cannot always successfully be pasted into the web page from rtf and PDF files, insert a table using WYSIWYG toolbar. Alternatively, you can use a snipping tool to cut and insert the image of the text. Remember to add appropriate alternative text or caption.Â