DokuWiki

It's better when it's simple

User Tools

Site Tools


tips:locateorphanmedia

Locate Orphan Media

Sometimes you get to upload many, many files in your DokuWiki installation. Or you are trying to do some maintenance and save space. Anyways, you'll come to the point where you may ask yourself, “what media files are actually being used in my wiki installation”?

There are some ways to clean up media automatically. clean_media_directory shows a Perl snippet that gets rid of the unlinked files. Here however we are only interested in generating a list of unlinked files for later use.

The following is simply some Unix shell utility combo put together during the night. It can be easily improved upon and integrated eg.: in a cronjob.

Adapted from this forum thread.


The Process

We first need some requirements.

  • Shell access to the server hosting your DokuWiki installation.
  • Access to some basic utilities such as cut, egrep and sed.
  • Access to your DokuWiki directory; here it will be called $DOKU_DIR.

First, let's list all the media uploaded by accessing your DokuWiki's media directory. The following snippet creates a temporary file in /tmp which lists all your media files in DokuWiki syntax (ie.: path:to:media_file):

list_media_files.sh
[user@host] $ cd $DOKU_DIR/data/media
[user@host] $ find  -not -type d | cut -c 2- | tr '/' ':' > /tmp/mediafiles.txt

Explanation: We find first any file in the media dir that is not a (d)irectory. This gets all the media files. We remove the first two characters which are ./ to obtain relative links from the base media directory, and then we transform all slashes into colons to adapt the links to DokuWiki link syntax. The result of this is stored in /tmp/mediafiles.txt.

Now, check all text files in the pages directory, and list all text patterns of the form {{:mediafile[...]}} (note the leading colon is there to dismiss external links). This snippet creates a file in /tmp listing all the direct invocations to media files.

list_media_invocations.sh
[user@host] $ cd $DOKU_DIR/data/pages
[user@host] $ find  | xargs grep -P -oh "\{\{[.]?\:.+?\..{3}(\|.+)?\}\}" \
| sed -e 's/{{\./{{/' -e 's/|[^}]*//g' -e 's/[{{|}}]//g' > /tmp/mediareferences.txt

This creates /tmp/mediareferences.txt a text file containing all the media file invocations, stripped of their markdown and any custom title. It requires that the media references begin with a colon (or a period) as if they were absolute links, but should work for most media references in a wiki which are uploaded via the Upload Manager, or linked to via the Link Wizard.

Explanation: We find and retrieve a list of all the pages in the wiki installation. For each page we must find any instances of media invocations. These are defined as text patterns of the form {{[.]:path:to:media_file.gif|Some text}}, where the period is optional (and indicates the link is relative to the current directory). Extensions are assumed to be three character long (ie.: “gif”, “zip”). The leading {{ and any text starting at | or } are removed from the pattern, eventually leaving only the media link. Finally all those patterns are stored in a file.

A/N: Note that the snippet above will not necessarily catch all media links. In particular, it will fail to catch some relative links; this will be improved upon.

Now the only thing remaining is to find all files indicated in/tmp/mediafiles.txt that do not appear in /tmp/mediareferences.txt:

[user@host] $ grep -v -F -f /tmp/mediafiles.txt /tmp/mediareferences.txt > orphanedmedia.txt

Voilà. orphanedmedia.txt contains the wikipaths of all the media files that are never invoked, in DokuWiki link format (:path:to:media_file.gif).

Considerations

Not 100% safe (see above) but should locate most orphan files if media references are always inserted through the media manager. Also note I'm not a Bash master or something, just worked out some tools until it worked.

Both scripts could be further adapted to assure catching all relative media links. I'll be studying how to do that.


See Also

  • orphanmedia - Find orphan and missing media files.
  • unusedmedias - a plugin which lists all orphaned media files which are not used in wiki pages and gives option to delete them.
  • clean_media_directory – a Perl script which managed orphaned media files.
  • Plugin Request which originated this example.

Discussion

:?:

tips/locateorphanmedia.txt · Last modified: 2015-03-14 22:57 by 87.158.123.50

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki