tips:extract

The following snippet will extract all external links from the wiki. They are extracted with their surrounding context. The context includes everything from the preceding beginning of the line OR link OR beginning of the sentence to the succeeding end of the line OR link OR end of the sentence.

<?php
 
// where your dokuwiki data resides
$dir = '/var/lib/dokuwiki/data/pages/';
 
$pages = explode("\n", `find "$dir" -iname *.txt`);
 
foreach ($pages as $page)
{
  $contents = file_get_contents($page);
 
  //bracketed links: [[http://www.bla...|link text]]
  $regexBracketed = '\[\[(?P<bracketed>https?:\/\/[^\]]+)\]\]';
  // free-standing links: www.google.com (no http:// necessarry)
  $regexFree = '(?P<free>(?:https?:\/\/)?(?:www|ftp)\.[^\s]+)';
  // punctuation marks to find the end or beginning of a sentence
  $punctuation = '\.|\!|\?';
  // get the rest of the sentence, only if there is not another link in the same sentence
  $suffix = '(?=.*?((www|ftp)\.|\[\[https?\:\/\/).*?($|'.$punctuation.'))|(.*?(?:$|'.$punctuation.'))';
 
  $regex = '/(?:^|'.$punctuation.')?(?P<prefix>.*?)(?:'.$regexBracketed.'|'.$regexFree.')(?P<suffix>'.$suffix.')/m';
 
  preg_match_all($regex, $contents, $matches, PREG_SET_ORDER);
 
  foreach ($matches as $match)
  {
    // $match[0] contains the entire match as described above
 
    // see which kind of link we discovered
    $hit = (!empty($match['bracketed'][0]) ? $match['bracketed']: $match['free']);
 
    // split into link|linktext if neccesarry
    if (strpos($hit, '|') === false)
    {
      $url = $hit;
      $title = $hit;
    }
    else
    {
      list($url, $title) = explode('|', $hit, 2);
    }
  }
}

For further questions you can contact me at g [dot] sorst [at] clickforknowledge [dot] com.