Microsoft KB Archive/A guide to mass-adding files

From BetaArchive Wiki

This is a short guide on how to import KB articles located in MSDN help files into this wiki. Note that most of the information will be useful to Andy and anyone interested in how this works, and not for the average end-user.

  1. Look for the .chm file that contains the KB files you want.
  2. Use 7zip to extract all the files inside it into a directory of your choice.
  3. Navigate to one of the directories. You should see full of files of the form Q[number].htm.
  4. Use the pandoc program to convert all the files inside it into MediaWiki. Convert it into a different folder. For example, if you want to convert them into a folder convert, use the command (on Linux/WSL) for f in *.htm; do pandoc "$f" -f html -t mediawiki -s -o "convert/${f%}"; done.
  5. Strip off the Q in each file - use the command rename 's/.{1}(.*)/$1/' * to do it.
  6. On a local installation of MediaWiki, use the command sudo php /var/www/html/mw/maintenance/importTextFiles.php -u X010 -s "stage1" --prefix "Microsoft KB Archive/" *.htm, where X010 should be replaced by your username to import all the files into that installation. For small batches, Special:Import to the wiki could work as well.
  7. Export it: use the command sudo php dumpBackup.php --current --output gzip:out to export everything in the wiki as a gzip. Make sure to decompress it, and remove any unwanted pages in the output.
  8. Use Special:Import to import everything into this wiki.

Special notes:

  • The import times out: you'll need to split it. Testing with this wiki shows that about 350 pages can be imported at a time without any issues (in that case, instead of using dumpBackup.php, use Special:AllPages, copy into Notepad, and then copy its output into Special:Export.
  • Alternatively, install the Html2Wiki extension. While it's very powerful in that one can simply zip all the files and then use the extension to upload all at once, you'll also need to drastically increate the timeout limits. In local settings, about 500 pages could be imported at a time this way (hence one can split into N/500 zip files).
  • Currently, only moderators and higher can import pages. Should you require this, please contact X010 or Andy or mrpijey, and we may grant the import permission to you. The same applies if the size is too large, in which case we'll import it directly on your behalf.
  • For later KB files, there will be links which Pandoc will convert to the form [[qnumber|number]]. To convert this to the form [[number|number]], we need to apply a pair of sed commands: sed -i -- 's/|/|/g' * and sed -i -r 's/\[\[q/\[\[..\//g' *. It's also possible to do this post-conversion, but that will be costlier and potentially lead to timeout issues.
  • Performing these commands on large files (in Linux/WSL at least) will cause an "Argument too long" error. To fix this, we need to wrap the commands: that is, in place of the sed operations find . -maxdepth 1 -type f -name '*' -exec sed -i -- 's/|/|/g' {} + and find . -maxdepth 1 -type f -name '*' -exec sed -i -r 's/\[\[q/\[\[..\//g' {} +. In place of rename use mmv q\* \#1, do NOT try to wrap them in a find/rename combo like for i in *; do rename 's/.{1}(.*)/$1/' "$i"; done;, this is way too slow.
  • If you have local access to the server, the last two steps should be skipped. Also make sure to run the maintenance rebuildall.php and runJobs.php in the end. This is how we the 215000 KB articles were added to the wiki - the import took around 3 hours, with the maintenance scripts also taking a significant amount of time.
  • Special:ReplaceText can be used to fix remaining links. However, for some reason it will only check 250 pages at a time, so you'll need to go to /extensions/ReplaceText/src and change the limit in ReplaceTextSearch.php to something like 9990 (on this wiki the PHP limit is set to 10000). Also remember to run runJobs.php if possible, or otherwise raise the $wgJobRate to something like 10.