docsearch: jodconverter and OpenOffice.org
I would like to share some conversion settings which worked for me
I am using the jodconverter together with openoffice in headless mode and the following settings:
- converter.php
doc <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out% docx <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out% odt <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
The calc formats ods, xls, xlsx can be converted by using scripts to convert them to .csv first using jodconverter and then rename them to .txt. Unfortunately only the first spreadsheet gets converted when output is csv. Using PDF conversion all spreadsheets including their names get converted (tested only for ods).
Unfortunately the jodconverter does not convert ppt or pptx directly to txt. It would be possible to convert them first to a PDF and run the the pdftotxt converter afterward but I don't like the overhead of such a chained solution. Are there any free command line tools out there to convert the mentioned format on a Linux machine?
HINT:When using OpenOffice.org in headless mode. Make sure you have enough memory. Otherwise it can crash and the indexing of all following documents will fail → jodconverter complains that it can not connect to the OpenOffice.org server.
Using jodconverter OpenOffice.org and a script
- converter.php
ppt <path to office2txt.sh>/office2txt.sh %in% %out% pptx <path to office2txt.sh>/office2txt.sh %in% %out% odp <path to office2txt.sh>/office2txt.sh %in% %out% xls <path to office2txt.sh>/office2txt.sh %in% %out% xlsx <path to office2txt.sh>/office2txt.sh %in% %out% ods <path to office2txt.sh>/office2txt.sh %in% %out%
Here is the bash script I am using to do a chained conversion because jodconverter can not convert them directly to txt files. First to PDF and then to txt. Comments welcome since I am no bash guru…
- office2txt.sh
#!/bin/bash # Converter script to convert almost everything openoffice can read to txt using the jodconverter # and the pdf2txt tool # Because the jodconverter can not convert files formats like ppt, pptx, xls, ods, xlsx to txt directly, # a conversion to PDF is performed first using the jodconvert. The second step is a conversion from # PDF to txt using the pdftotxt commandline tool # usage: all2text.sh <inputfile> <outputfile> # <inputfile> is a arbitrary file open office can read (with correct file extension!) # <outputfile> is the filename the result should go to. (txt as file extension) # # adapt the settings below to your own needs echo "Input: $1" #jodconverter binary cmd JODCONVERTER_CMD=/opt/jodconverter/lib/jodconverter-cli-2.2.2.jar #pdf2txt binary cmd (find out your path using the 'which pdftotxt' cmd) PDF2TXT_CMD=/usr/bin/pdftotext #your java cmd JAVA_CMD=/usr/bin/java #temporary folder for storing the PDF (path without trailing /)(you need to have write access here!) TMP_FOLDER=/tmp/pdftmp #extract input name input_fullfile=$1 input_filename_w_ext=$(basename "$input_fullfile") input_extension=${input_filename_w_ext##*.} input_filename_wo_ext=${input_filename_w_ext%.*} #first conversion to PDF: tmpfile=$TMP_FOLDER/$input_filename_wo_ext".pdf" $JAVA_CMD -jar $JODCONVERTER_CMD "$input_fullfile" "$tmpfile" #second conversion to txt: $PDF2TXT_CMD "$TMP_FOLDER/$input_filename_wo_ext.pdf" "$2" #remove tmp file rm -f $tmpfile