TL;DR: In Joomla there’s often more than one way to get to the same page, if you’re moving off joomla this needs to be considered, preferably before moving any pages. This article discusses the cases that I found during my recent move to hugo.
Recently I decided to move a joomla site over to hugo; I had search engine friendly URLs turned on and therefore assumed the mapping should be easy. Whatever the path was in Joomla, should be the same path for Hugo (insert CMS of choice) right?
Wrong, most definitely not so, Joomla has many ways of accessing an article, either by reference to the article component or via a module, maybe even by category or by menu. At the end of the day the URL mapping is done using mod_rewrite in apache in a different way to what one may expect. Below are some examples:
Unfortunately it’s often the case that we wouldn’t look to closely at this until the time comes to move off Joomla - like my recent move to hugo. It’s at this point that the full extent of the situation becomes apparent.
There really is no magic wand that’s going to fix this, it’s just a case of working out all the different combinations from the apache access logs and webmaster tools; then trying to map over as many as possible. Some URLs with lots of parameters may be more trouble than they are worth, leaving the option of a good 404 handler rather than fix every link.
Below, I provide two example Linux shell commands that may be useful, the top one does GET requests and the second POST, they excludes robots.txt, PHP and PNG files, then provide a unique list of 404s:
grep 404 access_log* |grep -v .png|grep -v php | grep -v robots.txt | sed -En 's/.*"(GET[^"]*).*/\1/p' | uniq -c grep 404 access_log* |grep -v .png|grep -v php | grep -v robots.txt | sed -En 's/.*"(POST[^"]*).*/\1/p' | uniq -c