Table of Contents

HTMLDOC

HTMLDOC is a free, high quality HTML to PDF converter. The only drawback is, that it doesn't support CSS in its current version. (You can still achieve good results!) The big advantage is, that you don't need to install something else. (no ghostscript e.g.)
To use htmldoc in your wiki do the following:

With IE6 you get a “file not found” error if you try to open the pdf generated by dokuwiki with the default pdf reader ( although you may save the file to your local disk). You can fix this by appending this line in the code above :

header('Cache-control: private, must-revalidate');

Da DokuWiki auf UTF-8 basiert und die PDFs die Umlaute nicht richtig anzeigen, kann die letzte Funktion durch diese ersetzt werden:

function umlaute($text){
  return strtr($text,array(
      utf8_encode("ß") => "ß",
      utf8_encode("ä") => "ä",
      utf8_encode("ü") => "ü",
      utf8_encode("ö") => "ö",
      utf8_encode("Ä") => "Ä",
      utf8_encode("Ü") => "Ü",
      utf8_encode("Ö") => "Ö"));
}

HTML->PS->PDF

This is some of the steps and modifications that were previously at bobbaddeley.com.


For inline image support and internal dokuwiki refs to work in the pdf, modify the html2ps command:

$docbase="/var/www/html/";
$urlbase="http://my.dokuwiki.host/dokuwiki";
$command1="/usr/bin/html2ps -b " . $urlbase . " -r " . $docbase . " -o " . $filenameTemp . " " . $filenameInput;

Probably the $docbase and $urlbase path could be read from some variable which someone else knows…

Discussion

Great work thanks. But in order to get filenames based on page name with HTMLDoc, I had to change the line inserted in lib/tpl/template/main.php

<?php print html_btn('exportpdf',$ID,'',array('do' => 'export_pdf', 'id' => $ID)) ?>  <!-- inserted line -->

I also had to remove space after wikiexport in inc/common.php

header("Content-Disposition: attachment; filename=wikiexport" . str_replace(':','_',$_GET["id"]) . ".pdf");

To retrieve images from the wiki server (relative links, hope that it won't cause security issues) (I had problems with PNG files, so I converted them into JPEG format

$text = preg_replace("'<img src=\"/(.*?)/lib/exe/fetch.php(.*?)media=(.*?)\"(.*?)>'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1/data/media/\\3\">", $text); # for uploaded images
$text = preg_replace("'<img src=\"/(.*?)\"'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1\"", $text); # for built-in images, smileys for example

The generated code for the table of contents contains endlines so

$text = preg_replace("'<div class=\"toc\">.<div class=\"tocheader\">.*?</div>.</div>'si",'',$text );

For the umlaute function (french support and images link support)

function umlaute($text){
  return strtr($text,array(
      "ß"=>"&szlig;",
      "ä"=>"&auml;",
      "ë"=>"&euml;",
      "ï"=>"&iuml;",
      "ü"=>"&uuml;",
      "ö"=>"&ouml;",
      "Ä"=>"&Auml;",
      "Ë"=>"&Euml;",
      "Ë"=>"&Iuml;",
      "Ü"=>"&Uuml;",
      "Ö"=>"&Ouml;",
      "â"=>"&acirc;",
      "ê"=>"&ecirc;",
      "î"=>"&icirc;",
      "ô"=>"&ocirc;",
      "û"=>"&ucirc;",
      "Â"=>"&Acirc;",
      "Ê"=>"&Ecirc;",
      "Î"=>"&Icirc;",
      "Ô"=>"&Ocirc;",
      "Û"=>"&Ucirc;",
      "à"=>"&agrave;",
      "è"=>"&egrave;",
      "ù"=>"&ugrave;",
      "é"=>"&eacute;",
      "À"=>"&Agrave;",
      "È"=>"&Egrave;",
      "Ù"=>"&Ugrave;",
      "É"=>"&Eacute;",
      "ç"=>"&ccedil;",
      "%3A"=>"/"
));
}

Virgile Gerecke 2005-08-19 15:00

Why does the background not get right to de pdf? I mean for <code> sections?

After a bunch of experimentation with the various techniques noted above, I found that the best way to get PDF versions of DokuWiki pages is simply to print to PDF. On Linux and Mac that should be included with your operating system, and Windows users can use PrimoPDF.

Hi,
with htmldoc-1.8.24, I have to modify inc/common.php:

 $command = "/usr/bin/htmldoc  --webpage --no-title -f " . $filenameOutput . " " . $filenameInput;

An HTMLDOC variant

Hi, my main problem was the following:

So, I expanded the first HTMLDOC conversion code with some usable features:

:!: You need to use same 'pdfcp' code-page-string and @code-page string. You can use comments as two ways: with // or # delimiters.
You can use more @code-page sections in this file.

So, you can get better, nicer, more varied and more portable documents with these modifications.
Cheers, — Peter Szládovics 2005-11-17 20:09


HTMLDOC Variant Modifications

I couldn't get PDF export to handle images and links properly until I replaced the following lines in the inc/common.php (from the Peter's code shown immediately above):

  $text = preg_replace("'src=\"http://.*media='i",'src="'.DOKU_INC.'data/media/',$text); # Change links to path of images
  $text = preg_replace("'href=\"http://.*media=.*class=\"media\"'i",'href="" class="media"',$text); # remove picture link address

with these lines:

  $text = preg_replace("'<img src=\"/(.*?)/lib/exe/fetch.php(.*?)media=(.*?)\"(.*?)>'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1/lib/exe/fetch.php?media=\\3\">", $text); # for uploaded images
  $text = preg_replace("'<img src=\"/(.*?)>'si","<img src=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1>", $text); # for built-in images, smileys for example
  $text = preg_replace("'href=\"/(.*?)>'si","href=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1\">", $text); # for internal links

I also changed the temporary filenames so that static names were not used by changing (also in Peter's modifications to inc/common.php)

  $filenameInput=$dir."input.html";
  $filenameOutput=$dir."output.pdf";

to this:

  $filenameInput=tempnam($dir,"input_");
  $filenameOutput=tempnam($dir,"output_");

The PDF export is a great and very essential feature. Thanks! :-)Brian Dundon 2006-3-18 22:44

Remark on HTMLDOC Variant Modifications

I use the code and I get a PDF file. But when I try to open this file, I get “The file is damaged and could not be repaired”. The command line parameters are

htmldoc -t pdf12 --browserwidth 1280 --jpeg=0 --charset iso-8859-15 --no-title --embedfonts --toctitle "Inhaltsverzeichnis" -f /tmp/output_FSJlGa /tmp/input_GM1H6V

I tried with -t pdf14 and --charset windows-1252 too, but the same result. I use htmldoc-1.9.x-r1484. – Werner 2006-05-13 – P.S.: with HTMLDOC version 1.8.26 everything works fine. It may be a HTMLDOC issue since even the htmldoc.pdf generated via make is unreadable.

Remark 2 on HTMLDOC Variant Modifications

The updated code (from Brian Dundon) used above uses urls to find the images. This variant uses full paths on the unix/linux system to find the images. Urls won't work when logins are required for dokuwiki.

Note: still have to have a look at the line for changing path of built-in images

  $text = preg_replace("'src=\"http://.*media='i",'src="'.DOKU_INC.'data/media/',$text); # Change links to path of images
  $text = preg_replace("'href=\"http://.*media=.*class=\"media\"'mi",'href="" class="media"',$text); # remove picture link address
  $textarr = preg_split("/\n/",$text);
 
# Find and change linked images
  $linkeds = preg_grep("'<a href=.*<img src=.* /></a>'i",$textarr);
  foreach ( $linkeds as $linked ) {
    $picture = preg_replace("/<a href=.*\">/i",'',$linked);
    $picture = preg_replace("'</a>'i",'',$picture);
    $found = "'".preg_quote($linked)."'";
    $text = preg_replace($found,$picture,$text);
  }

with

  # find all user uploaded images (and make sure this code doesn't get treated as an image!
  preg_match_all ( '/media=([\./a-z].*?)"/mi', $text, $matches);
  # only use unique elements from array with matches following first parenthesis
  foreach (array_unique($matches[1]) as $match){
    # change namespace into directory syntax
    $newimg = preg_replace ('/:/', '/', $match );
    $text = preg_replace ( "/media=$match/m", "media=$newimg", $text );
  }
 
  $text = preg_replace('|<img src=".*?/lib/exe/fetch.php.*?media=(.*?)".*?>|mi',"<img src=\"" . DOKU_INC . "data/media/\\1\" />",$text); # for images
  $text = preg_replace('|<img src=".*?/(lib/images/.*?)".*?>|mi',"<img src=\"" . DOKU_INC . "\\1\" />",$text); # for internal images
  $text = preg_replace("|href=\"/(.*?)>|mi","href=\"http://" . $_SERVER['SERVER_NAME'] . "/\\1\">", $text); # for internal links
#you can put this in or leave it out...  $text = preg_replace("|href=\".*?media=.*?class=\"media\"|i",'href="" class="media"',$text); # remove picture link address
  $textarr = preg_split("/\n/",$text);
 
# Find and change linked images
  $linkeds = preg_grep("|<a href=.*?<img src=.*? /></a>|i",$textarr);
  foreach ( $linkeds as $linked ) {
    $picture = preg_replace("|<a href=.*?\">|i",'',$linked);
    $picture = preg_replace("|</a>|i",'',$picture);
   $found = "'".preg_quote($linked)."'";
    $text = preg_replace($found,$picture,$text);
  }

Little remark:
Make sure conf/replace.conf is called conf/replaces.conf (with an 's').

Other little update:
swap

system("exit(0)");

with

system("exit 0");

The exit(0) generated an error in the apache error.log file.

PDFexport is very handy for dokuwiki. Thanx for the code. Hope this remark helps others as well. — F. Masselink 2006-10-24 11:33

Remark of Remark 2 on HTMLDOC Variant Modifications

i had a problem with preg_match_all, solved by changing the fallowing lines: replace in file inc/common.php

preg_match_all ( '/media=([\./a-z].*?)"/mi', $text, $matches);

with

preg_match_all ( '/media=([\.\/a-z].*?)"/mi', $text, $matches);

A.Chiapparini 2007-01-22 14:47

Modification for nice URLs

I'm using nice URLs with rewrite method and I must add this line for internal pictures to work. It goes to the other preg_replace lines.

$text = preg_replace('|<img src=".*?/_media/(.*?)\?.*?".*?>|mi',"<img src=\"" . DOKU_INC . "data/media/\\1\" />",$text);

Also I have troubles with filename of downloaded PDF, there is my modification of header line at the end of the pdfmake function. This sends filename containing wiki name and only the name of the wikipage, not the namespace location.

header("Content-Disposition: attachment; filename=".str_replace(' ','_',$conf['title']).'-'.end(split('/',$_GET["id"])).".pdf");

HTMLDOC recursive variant

My problem was that i needed support for child page export. It therefore choose to modify / hack An_HTMLDOC_variant found on this page. Some of the remarks / improvements to An_HTMLDOC_variant have also been included.

It will thus perform a recursive export of your current page. This means that any internal links will be followed and converted to PDF too. The internal links should copied to the PDF - meaning that they are click-able like they are in dokuwiki.

Remember that I only tested this on my own sever (on which it works). So expect bugs and / or strange behavior.

Bug fixes

Here follows a list of fixed bugs

Nicklas Overgaard 2009-10-31 16:45 GMT+1

HTMLDOC and OS X

The official HTMLDOC packages for OS X are not free. I did find another package at http://www.bluem.net/downloads/htmldoc_en/

In inc/common.php I had to change

$command = "/usr/bin/htmldoc --no-title -f " . $filenameOutput . " " . $filenameInput;

to

$command = "/usr/local/bin/htmldoc --webpage --outfile " . $filenameOutput . " " . $filenameInput;

David McCallum 2006-06-30 16:09


Installing HTMLDOC

Installing HTML should be pretty easy, as written above.
But sadly I don't know how to install it on the server (only got uploadpermission by ftp-client).
I downloaded htmldoc-1.9.x-r1474 and uploaded the folder but can't find an installer package.

Does anyone has a simple step by step instruction for beginners?
TANK YU

if you are using a Debian or derived operative system than you can use the following command:

# apt-get install htmldoc

Remark of Remark 3 on HTMLDOC Variant Modifications

Improvements:

  1. not need of a writable directory: use the system temporary directory (automatic)
  2. use of unique filename so more people can export file without problems

if you replace

  $dir=DOKU_INC."tmp/";
  $filenameInput=$dir."input_";
  $filenameOutput=$dir."output_";

with

  $filenameInput=tempnam('','html');
  $filenameOutput=tempnam('','pdf');

HTMLDOC request

I think that will be very useful if you can create a page with the list of wiki page to export and HTMLDOC export all these pages into a PDF file.
For example if the wiki page start with a special string (example: HTMLDOC_EXTRACT) then he must extract all pages listed below.
Example:

HTMLDOC_EXTRACT

  * [[tips]]
    * [[tips:pdfexport]]
    * [[tips:browserlanguagedetection]]

So you can create pages from which you can extract a PDF file based on more wiki pages

Check the HTMLDOC_recursive_variant it should support the requested feature.

Config problem with HTMLDOC variant

If you had added all the $conf['abc'] lines to dokuwiki.php file and you get an 'Undefined Settings' section on your config page like

$conf['browserwidth']No setting metadata.
$conf['customtoc'] No setting metadata.

you have to declare all value in your config.metadata.php

Changes to the TOC

Some recent changes in the core will break all the TOC-related code above, because the HTML for the TOC has been rewritten. The changes will be part of DokuWiki from the next release on (autumn 2012).

1)
On Debian: These are included in packages html2ps and gs-common. All other needed packages (gs for example) are installed using apt-get automatically.