====== How to convert docs to DokuWiki ======
I was just googleing a little to some conversion tools.
Hopefully i meet this :
http://www.linux.com/articles/61713
====== Main goal : Magic conversion in bureaucratic environment ======
\*.doc -> *.html ---> *.txt ((wiki syntax))
To do this here the main schema in use :
==== Step 0 | Preparing the environment ====
=== Dependencies : ===
* Linux !
* OpenOffice.org: http://www.openoffice.org/
* Java and JODConverter by: http://www.artofsolving.com
* Perl and WikiConverter module: http://search.cpan.org/src/DIBERRI/HTML-WikiConverter-0.61/
* Apache, PHP, and a DokuWiki out of the box ... or MediaWiki..
* Optional, the extension FCKW for DokuWiki: [[plugin:fckw]]
=== Code needed ===
three files :
- The main bash script : oocwiki.sh [[doc_to_wiki_syntax#oocwiki.sh|The code]].
- The cleaning bash script : cleanfolder.sh [[doc_to_wiki_syntax#cleanfolder.sh |The code]].
- The renaming / auto loop conversion Perl script : oocwiki.pl [[doc_to_wiki_syntax#oocwiki.pl |The code]].
Copy this code and create the files needed in a folder of your computer.
=== Folders : ===
Create your folder with your bunch of Ms Word files :
Ms World environment :
ENWOLRD=/home/massou/Documents/oldies/
and write on the bash script the parameters for others folders and files we need :
Temp folder :
TMPOOCWIKI=/tmp/oocwiki/
JODConverter folder ;
JODCON=/home/massou/Documents/perl/jodconverter-2.2.1/lib/jodconverter-cli-2.2.1.jar
DokuWiki transfert folder :
OUTWIKI=/srv/www/htdocs/dokuwiki/data/pages/outdoc/
OUTMEDIA=/srv/www/htdocs/dokuwiki/data/media/outdoc/
and use this bash
==oocwiki.sh==
#!/bin/bash
# script oocwiki.sh
#
# sh oocwiki.sh /home/massou/Documents/oldies/ /tmp/oociKi/
# This script makes a backup of my home directory.
# Change the values of the variables to make the script work for you:
ENWOLRD=/home/massou/Documents/oldies/
TMPOOCWIKI=/tmp/oocwiki/
JODCON=/home/massou/Documents/perl/jodconverter-2.2.1/lib/jodconverter-cli-2.2.1.jar
OUTWIKI=/srv/www/htdocs/dokuwiki/data/pages/outdoc/
OUTMEDIA=/srv/www/htdocs/dokuwiki/data/media/outdoc/
if [ $(whoami) != 'root' ]; then
echo "Must be root to run $0"
exit 1;
fi
# if [ -z $1 ]; then
# echo "Usage: $0 "
# exit 1
# fi
parameters=($ENWOLRD $TMPOOCWIKI $OUTWIKI $OUTMEDIA)
## is parameters ok ?
for i in ${parameters[@]}; do
if [ ! -e "${i}" ]; then
echo "${i} don't exist"
mkdir ${i}
echo "${i} resolved"
elif [ -f "${i}" ]; then
echo "${i} est un fichier"
elif [ -d "$1" ]; then
echo "${i} sembre prêt"
fi
done
if [ ! -e "$JODCON" ]; then
echo "$JODCON n'existe pas"
exit 1;
elif [ -f "$JODCON" ]; then
echo "$JODCON is ready"
fi
pgrep soffice
retval=$?
if [ "$retval" = 1 ]
then
echo "soffice n'a pas l'air de fonctionner..."
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard &
fi
###cleaning and copy
parameters=($TMPOOCWIKI $OUTWIKI $OUTMEDIA)
## is parameters ok ?
for i in ${parameters[@]}; do
if [ -e "${i}" ]; then
echo "${i} don't exist"
rm -R ${i}
mkdir ${i}
echo "${i} resolved"
fi
done
cp -R $ENWOLRD/* $TMPOOCWIKI
################### Step1 Some cleaning ##################
sh ./cleanfolder.sh $TMPOOCWIKI
######################### Step 2-3 Time of perl #################
perl oocwiki.pl $TMPOOCWIKI $JODCON
######################### Step 4 Copy of the files #################
cp -R $TMPOOCWIKI/* $OUTWIKI
cp -R $TMPOOCWIKI/* $OUTMEDIA
########### Step 5 time for ACL #########
parameters=($OUTWIKI $OUTMEDIA)
## is parameters ok ?
for i in ${parameters[@]}; do
chown -R wwwrun ${i}
chgrp -R www ${i}
chmod -R 775 ${i}
done
==== Step 1 | cleaning the Ms Word environment :====
/////*.doc
Bash or Perl script for renaming folder / under folder / file name from Windows file system to more simply Unix-like syntax
==cleanfolder.sh==
#!/bin/bash
# file cleanfolder.sh
# Convert filenames to lowercase
# and replace characters recursively
#####################################
if [ -z $1 ];then echo Give target directory; exit 0;fi
find "$1" -depth -name '*' | while read file ; do
directory=$(dirname "$file")
oldfilename=$(basename "$file")
newfilename=$(echo "$oldfilename" | tr 'A-Z' 'a-z' | tr ' ' '_' | sed 's/_-_/-/g')
if [ "$oldfilename" != "$newfilename" ]; then
mv -i "$directory/$oldfilename" "$directory/$newfilename"
echo ""$directory/$oldfilename" ---> "$directory/$newfilename""
#echo "$directory"
#echo "$oldfilename"
#echo "$newfilename"
#echo
fi
done
exit 0
==== Step 2 : ====
lower_case/whithout_blank_space.doc ---> Soffice as a service + jodconverter ---> *.html
==oocwiki.pl==
#!/usr/bin/perl -w
$time = localtime;
print "The time is now $time\n";
my $TMPOOCWIKI=$ARGV[0]."\n";
my $JODCON=$ARGV[1]."\n";
print $TMPOOCWIKI."\n";
print $JODCON."\n";
$chemin = $TMPOOCWIKI;
$jod = $JODCON;
chomp($chemin);
chomp($jod);
use File::Basename;
use File::Find;
find(\&Wanted, $chemin);
sub Wanted
{
if ($File::Find::name =~ m/^$DocumentRoot(\/.*)?$/) {
$fullname = $File::Find::name . "\n";
($name,$path,$suffix) = fileparse($fullname,qr{\..*});
$suffix . "\n";
if ($suffix eq '.doc'){
# if ($suffix = "\.doc") {
$name = fileparse($fullname);
$basename = basename($fullname);
$dir = dirname($fullname);
$base2=lc($name);
$base2 =~ tr/ /_/;
$base2 =~ tr/ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ/aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn/;
#Step1 renaming, again
$dir =~ s/$/\//;
$newname = $dir.$base2;
# $newname =~ s/$/\.doc/;
print $fullname;
print $newname;
print $fullname;
print $newname;
# $fullname =~ s/ /\\ /;
# $newname =~ s/ /\\ /;
chomp($fullname);
chomp($newname);
# # print $newname;
rename("$fullname", "$newname") or
warn "Couldn't rename $fullname to $newname: $!\n";
#Prepare newname for conversion
$newname2 = $newname;
$newname3 = $newname;
$newname2 =~ s/\.doc$/\.html/ ;
$newname3 =~ s/\.doc$/\.txt/ ;
# print "sortie-----$newname2\n";
# Subroutine to execute the command step 2 and 3
my $res="";
my $cmd="java -jar $jod $newname $newname2|";
my $cmd2="html2wiki --dialect DokuWiki $newname2 > $newname3|";
open(EXEC,"$cmd");
while($res=){
chomp($res);
print "$res \n";
}
close(EXEC);
open(EXEC,"$cmd2");
while($res=){
chomp($res);
print "$res \n";
}
close(EXEC);
}
}
}
==== Step3 : ====
*.html ---> HtmlWikiConverter ---> *.txt
==== Step4 : ====
Finally we just copy the files to media and pages folders... enough.
Perl scripting to change URL of media to point to good URL media and dispatch media and txt files in good place on the server...
==== Step5 : ====
Fix permissions.
====== Command lines in use ======
First you need OpenOffice.org on a Linux box.
go to a terminal and execute this :
soffice -headless -accept="socket,port=8100;urp;"
http://www.artofsolving.com/node/10
(dont forget cli :!=à=)
java -jar jodconverter-cli-2.2.1.jar A.doc A.pdf
java -jar jodconverter-cli-2.2.1.jar A.doc A.html
http://search.cpan.org/src/DIBERRI/HTML-WikiConverter-0.61/README
massou@linux-hj6y:~/Documents/momas/jodconverter-2.2.1/lib> html2wiki --dialect DokuWiki A.html > output.mw