What about download.microsoft.com, support.microsoft.com?

Discussion of beta and abandonware topics not fit for the other forums goes here.
Post Reply
tristanleboss
Posts: 63
Joined: Wed Jan 11, 2017 12:37 pm

What about download.microsoft.com, support.microsoft.com?

Post by tristanleboss »

I just discovered that the Wayback Machine coverage for these domains is not really amazing.

I tried to download many files linked to the Windows Live ID system (Windows Live ID Client 1.0 SDK Alpha Refresh, Windows Live ID Delegated Authentication SDK, Windows Live ID Web Authentication SDK,...) and none of them had been saved in the Wayback Archive. The download page can be found but the actual download files are not available...

It seems that globally, Microsoft web content is destroyed and not well preserved. The same conclusion can be made for the MSDN documentation, the KB articles (and their URLs changed many times during the life of the support site rendering them hard to find in the WBA), ... For now, it seems only the old Windows Updates (hotfixv4.microsoft.com, download.windowsupdate.com) are still available but for how long?

Do you know of any initiative to backup those sites?

Sept
Posts: 14
Joined: Sun Mar 05, 2017 10:13 pm

Re: What about download.microsoft.com, support.microsoft.com

Post by Sept »

You can try https://web-beta.archive.org/web/*/(DOMAIN HERE)/* if you haven't already. It scans everything archive.org has related to it.

tristanleboss
Posts: 63
Joined: Wed Jan 11, 2017 12:37 pm

Re: What about download.microsoft.com, support.microsoft.com

Post by tristanleboss »

Thanks. Yes, I tried and discovered the capture of these domains is really limited... for the hotfixv4.microsoft.com, there is 589 files :(

For download.microsoft.com, I tried all these files and only 2 were saved (scary, to say the least!):

http://download.microsoft.com/download/ ... rtsSDK.msi
http://download.microsoft.com/download/ ... itySDK.zip
http://download.microsoft.com/download/ ... entSDK.msi
http://download.microsoft.com/download/ ... th-1.0.msi
http://download.microsoft.com/download/ ... va-1.0.zip
http://download.microsoft.com/download/ ... 1.0.tar.gz
http://download.microsoft.com/download/ ... 1.0.tar.gz
http://download.microsoft.com/download/ ... 1.0.tar.gz
http://download.microsoft.com/download/ ... 1.0.tar.gz
http://download.microsoft.com/download/ ... cs-1.2.msi
http://download.microsoft.com/download/ ... va-1.2.zip
http://download.microsoft.com/download/ ... 1.2.tar.gz
http://download.microsoft.com/download/ ... 1.2.tar.gz
http://download.microsoft.com/download/ ... 1.2.tar.gz
http://download.microsoft.com/download/ ... vb-1.2.msi
http://download.microsoft.com/download/ ... 1.2.tar.gz
http://download.microsoft.com/download/ ... ebauth.msi
http://download.microsoft.com/download/ ... va-1.0.zip
http://download.microsoft.com/download/ ... 1.0.tar.gz
http://download.microsoft.com/download/ ... 1.0.tar.gz
http://download.microsoft.com/download/ ... 1.0.tar.gz
http://download.microsoft.com/download/ ... 1.0.tar.gz
http://download.microsoft.com/download/ ... th-1.1.msi
http://download.microsoft.com/download/ ... va-1.1.zip
http://download.microsoft.com/download/ ... 1.1.tar.gz
http://download.microsoft.com/download/ ... 1.1.tar.gz
http://download.microsoft.com/download/ ... 1.1.tar.gz
http://download.microsoft.com/download/ ... 1.1.tar.gz
http://download.microsoft.com/download/ ... cs-1.2.msi
http://download.microsoft.com/download/ ... va-1.2.zip
http://download.microsoft.com/download/ ... 1.2.tar.gz
http://download.microsoft.com/download/ ... 1.2.tar.gz
http://download.microsoft.com/download/ ... 1.2.tar.gz
http://download.microsoft.com/download/ ... 1.2.tar.gz
http://download.microsoft.com/download/ ... vb-1.2.msi
http://download.microsoft.com/download/ ... 008CTP.zip
http://download.microsoft.com/download/ ... 008CTP.zip

Apex
Posts: 7
Joined: Wed Sep 06, 2017 4:29 pm

Re: What about download.microsoft.com, support.microsoft.com

Post by Apex »

You can also use the Wayback CDX Server API to get a space-delimited list of captures and metadata. This has the advantage of allowing you to filter by MIME type, status code, uniqueness, and so on.

For example, to get a list of 1000 unique files from everything the IA has captured for download.microsoft.com:

Code: Select all

http://web.archive.org/cdx/search/cdx
?url=download.microsoft.com/download/
&matchType=prefix
&collapse=digest
&filter=statuscode:200
&limit=1000
Once you've filtered that list down to what you need, you can use the metadata to build a list of URLs to pass to wget:

Code: Select all

https://web.archive.org/web/<timestamp>/<url-minus-protocol>
This isn't much help if the Wayback Machine never captured the files in the first place, unfortunately.

TuneableSumo876

Re: What about download.microsoft.com, support.microsoft.com

Post by TuneableSumo876 »

You can also use the wayback_machine_downloader Ruby gem. Downloading an entire website will take time so you may want to let it run overnight (and/or download to a spare external hard drive, based on the size of the website).


Sent from my iPhone using Tapatalk

tristanleboss
Posts: 63
Joined: Wed Jan 11, 2017 12:37 pm

Re: What about download.microsoft.com, support.microsoft.com

Post by tristanleboss »

Yes, I tried all these methods. Unfortunately, if the file has not been saved nothing much can done.

It seems no one really cares about saving the fils on those servers.

merlix
Posts: 27
Joined: Tue May 26, 2015 7:28 am

Re: What about download.microsoft.com, support.microsoft.com

Post by merlix »

way back machine USED to save a complete list of download.microsoft.com, because I remember exploring the thousands upon thousands of files. But when they implemented robot.txt, it effectively lost the complete list of files located there, I wish they had just respected that from that time forward but they didnt. I just wish I had the foresight to have captured the listing.

TuneableSumo876

Re: What about download.microsoft.com, support.microsoft.com

Post by TuneableSumo876 »

If the Wayback Machine didn't respect robots.txt someone would file a lawsuit against them, for not respecting robots.txt.

Therefore we have this issue.


Sent from my iPhone using Tapatalk

Apex
Posts: 7
Joined: Wed Sep 06, 2017 4:29 pm

Re: What about download.microsoft.com, support.microsoft.com

Post by Apex »

You might try capturing that listing again. As far as I'm aware, the IA has never actually removed anything in response to changes in robots.txt. On the contrary, they have recently started to relax their observance. It's true that for a time many domains became unavailable because of robots.txt abuse, but I don't seem to get the dreaded "Page cannot be crawled or displayed due to robots.txt" anymore.

AlphaBeta
User avatar
Donator
Posts: 2437
Joined: Sun Aug 12, 2012 4:33 pm
Location: Czechia

Re: What about download.microsoft.com, support.microsoft.com

Post by AlphaBeta »

TuneableSumo876 wrote:If the Wayback Machine didn't respect robots.txt someone would file a lawsuit against them, for not respecting robots.txt.

Therefore we have this issue.


Sent from my iPhone using Tapatalk
Not respecting robots.txt was not a criminal offense last time I checked.
AlphaBeta, stop brainwashing me immediately!

Image

3155ffGd
User avatar
Posts: 391
Joined: Wed May 02, 2012 12:57 am

Re: What about download.microsoft.com, support.microsoft.com

Post by 3155ffGd »

I recently read a blog post about the old KB articles. Microsoft recently purged a lot of old KB articles going back to NT/2000/XP/95/98/ME and MS-DOS days. It's really a shame, because these KB articles contain information generally not found anywhere else. At least the Web Archive still works for those (but for how long?)

It was mentioned in that blog post that someone should form a collective to preserve old Knowledge Base articles so they don't get lost forever. It turns out, Microsoft used to publish KB articles on CD for a very long time, as part of Visual C++ or in the MSDN Library set of CD, or in the ancient days (1980s) as part of the Microsoft Programmer's Library. Some should also be found in the Microsoft FTP archive. The only work would be collecting them all, extracting them to a complete set and then archiving them.

DOS
User avatar
Posts: 205
Joined: Sun Mar 16, 2014 6:56 am

Re: What about download.microsoft.com, support.microsoft.com

Post by DOS »

tristanleboss wrote:I tried to download many files linked to the Windows Live ID system (Windows Live ID Client 1.0 SDK Alpha Refresh, Windows Live ID Delegated Authentication SDK, Windows Live ID Web Authentication SDK,...) and none of them had been saved in the Wayback Archive. The download page can be found but the actual download files are not available...
I noticed Internet Archive has a 66GB dump of ftp.microsoft.com from 2015, I wonder if any of those things would be in there? I'm curious about what is in there but I don't know if I want to download 66GB!
3155ffGd wrote:I recently read a blog post about the old KB articles. Microsoft recently purged a lot of old KB articles going back to NT/2000/XP/95/98/ME and MS-DOS days. It's really a shame, because these KB articles contain information generally not found anywhere else. At least the Web Archive still works for those (but for how long?)
I imagine you can't find them all on the web, as some of them are Microsoft promoting OS/2, and I think they decided to get rid of them *hehe*

There have been a few blog posts on this topic recently since three different bloggers were involved:

https://virtuallyfun.com/2017/10/17/mic ... es-online/
http://www.os2museum.com/wp/ms-kb-articles/
https://www.pcjs.org/blog/2017/10/13/ - has a link to where some recovered KB articles are being hosted
The only work would be collecting them all, extracting them to a complete set and then archiving them.
It's not trivial, as MSDN CDs use various formats with proprietary extensions, and the Microsoft Programmer's Library's file format isn't documented, but some of this work is in progress.

3155ffGd
User avatar
Posts: 391
Joined: Wed May 02, 2012 12:57 am

Re: What about download.microsoft.com, support.microsoft.com

Post by 3155ffGd »

DOS wrote:and the Microsoft Programmer's Library's file format isn't documented
You can probably extract that with HELPMAKE.EXE included with Microsoft C 5.1/6.0 (haven't tried it myself). Surprised no one figured it out.

I'm also working on a Knowledge Base Archive. I already downloaded the FTP archive and much to my disappointment, it doesn't just stop in 1999, it's also missing several older articles, especially those related to MS-DOS and other DOS applications (Word for DOS, Visual Basic 16-bit, etc.), but also a few Windows 95/NT articles with no apparent pattern behind it. It will be a lot of work restoring everything. It also doesn't contain any really useful downloads, if you were wondering about that, just a lot of old junk.

Someone estimated at some point the Knowledge Base had 200,000 articles. Right now I have 62,000 articles. Just so you know the scope of this.

DOS
User avatar
Posts: 205
Joined: Sun Mar 16, 2014 6:56 am

Re: What about download.microsoft.com, support.microsoft.com

Post by DOS »

3155ffGd wrote:
DOS wrote:and the Microsoft Programmer's Library's file format isn't documented
You can probably extract that with HELPMAKE.EXE included with Microsoft C 5.1/6.0 (haven't tried it myself).
No, that doesn't work. Also, if I recall correctly I had a look at the file and it doesn't look anything like the description of the Advisor help file format (or Windows 3.x .HLP file format).
I'm also working on a Knowledge Base Archive.
Have you considered working with Jeff from pcjs.org since he's already made one public?

3155ffGd
User avatar
Posts: 391
Joined: Wed May 02, 2012 12:57 am

Re: What about download.microsoft.com, support.microsoft.com

Post by 3155ffGd »

DOS wrote:No, that doesn't work.
Hrm. That sucks.
DOS wrote:Have you considered working with Jeff from pcjs.org since he's already made one public?
Is he actually planning to expand his library beyond what's already there? It was my impression that he was keeping just those few articles, but if you know more please tell me.

DOS
User avatar
Posts: 205
Joined: Sun Mar 16, 2014 6:56 am

Re: What about download.microsoft.com, support.microsoft.com

Post by DOS »

There's some more at https://jeffpar.github.io/kbarchive/, and I've been in discussion with him about extracting information from old MSDN CDs.

3155ffGd
User avatar
Posts: 391
Joined: Wed May 02, 2012 12:57 am

Re: What about download.microsoft.com, support.microsoft.com

Post by 3155ffGd »

That's interesting, didn't know about that.

I downloaded the MSDN January 2000 DVD and extracted the Knowledge Base articles. It was super easy since Microsoft used standard CHM files which you can easily extract with HTML Help Workshop, and the individual KB articles are even neatly sorted by KB number and topic. Certainly much more pleasant to deal with than the things that came before (.MXS/.HLP) and after (.HXS). The MSDN DVD contained on the order of 80,000 Knowledge Base articles, after copying everything over to my collection and using some script magic to identify duplicates, I now have exactly 107,055 Knowledge Base articles.

The hardest thing now is downloading every MSDN Library CD set and identifying Knowledge Base articles that are still missing, as apparently articles tended to disappear pretty randomly. I downloaded the first MSDN Library from 1992 and it contains a LAN Manager category that is completely missing in my collection so far. Even harder will be identifying and finding Knowledge Base articles Microsoft never put on MSDN, especially those involving their former "Microsoft Home" and "Microsoft Games" departments. I have no idea if those appeared anywhere, the MS FTP archive had some entries here and there and there's also MNY.EXE also on the FTP under the Softlib, but not much more.

3155ffGd
User avatar
Posts: 391
Joined: Wed May 02, 2012 12:57 am

Re: What about download.microsoft.com, support.microsoft.com

Post by 3155ffGd »

So I just wanted to give a quick heads-up.

I've been working on this up to just before this Christmas, after that the project got a little stale unfortunately. Right now I have a total of 202735 unique knowledge base articles spanning until around the end of 2007. This is only a work in progress though, I just stopped in the middle of my work and if I finish I'll probably end up around ~205000 to 210000 knowledge base articles.

I just wanted to know, is there actually any interest in me publishing this little archive? The folder is a little big, right now it's 1.14 GB uncompressed and even when compressed in a solid RAR archive it only goes down to 150 MB, so it will be a bit difficult to distribute. I've also been thinking of maybe making a HTML Help file out of this, if it is technically possible and feasible; it would require a lot of work though because right now I have a wild mix of .htm and .txt files coming from different sources and thus having completely different structures.

I'll finish the project anyway at some point (once I get some motivation and less stress with other things) but what's slightly depressing is that even with this amount of KB articles there are still many glaring gaps in the collection (Windows NT/2000 KB articles post-2003 are missing completely as well as anything gaming-related, especially Xbox/Zune). I don't know what to do about this, of course I could hunt for knowledge base articles in the Internet Archive but that's gonna involve a lot of work.

DOS
User avatar
Posts: 205
Joined: Sun Mar 16, 2014 6:56 am

Re: What about download.microsoft.com, support.microsoft.com

Post by DOS »

3155ffGd wrote:Right now I have a total of 202735 unique knowledge base articles spanning until around the end of 2007.
Nice! Is that from MSDN Library, TechNet, both, and/or other sources?
I just wanted to know, is there actually any interest in me publishing this little archive? The folder is a little big, right now it's 1.14 GB uncompressed and even when compressed in a solid RAR archive it only goes down to 150 MB, so it will be a bit difficult to distribute.
I'm interested!
a wild mix of .htm and .txt files coming from different sources and thus having completely different structures.
I'm surprised that you don't have RTF from decoding Multimedia Viewer files?
I'll finish the project anyway at some point
You're doing better than me at least :)

3155ffGd
User avatar
Posts: 391
Joined: Wed May 02, 2012 12:57 am

Re: What about download.microsoft.com, support.microsoft.com

Post by 3155ffGd »

By now I have worked through all the MSDN CDs from July 2007 to October 1994, and I have started work on one of the TechNet CDs. TechNet takes a lot longer to process because it contains so many categories that are not covered by the MSDN CDs, especially things like games, Microsoft Bob, Word for DOS, Microsoft Works etc. A few KB articles also come from other sources like MSPL 1.3 or the Windows NT 3.1 KB archive that's on shareware CDs.

Basically what I did was:

* Extract all Help2 HTML files with hxcomp.exe
* Extract all Help HTML files with HTML Help Compiler
* Recompile ivtlist.exe because the source code has a bug causing the program to fail on the MSDN CDs, then use it to exact the IVT HTML files
* Recompile helpdeco.exe because that also has a bug in the source code causing it to fail on some CDs, then use it to extract the .MVB files

As for the RTF files - I learned the hard way that Microsoft Word 2007 has a hardcoded 512 MB limit for RTF files, and will refuse to open files larger than that. It also cannot support more than 32,767 pages without having to turn off page view mode. WordPad from Windows 7 x64 did work but was so horribly slow that it was unusable. So I had to resort to a different solution - open the RTF files in a text editor and use some clever search & replace to remove all RTF specific formatting to end up with a plain text file.

I used a few helpful scripts - one that removes all .txt files which already have a corresponding .htm file, and one that automatically searches through .RTF files and reports all KB articles by number which are not yet present in the database.

The hard work is mostly with manually extracting the individual KB articles from the RTF files to individual .txt files. It probably could be scripted but I don't have the necessary scripting knowledge to do that.

DOS
User avatar
Posts: 205
Joined: Sun Mar 16, 2014 6:56 am

Re: What about download.microsoft.com, support.microsoft.com

Post by DOS »

3155ffGd wrote:* Recompile helpdeco.exe because that also has a bug in the source code causing it to fail on some CDs, then use it to extract the .MVB files
Is this the "Allocation of 0 bytes failed. File too big." issue (covered by https://sourceforge.net/p/helpdeco/bugs/1/)? If so, is there any chance you could share the patch?

3155ffGd
User avatar
Posts: 391
Joined: Wed May 02, 2012 12:57 am

Re: What about download.microsoft.com, support.microsoft.com

Post by 3155ffGd »

Could be that one, I don't remember, too long ago. Basically what I did is in this if:

Code: Select all

if((groups||multi)&&(browsenums>1))
comment out the check for multi, like this:

Code: Select all

if((groups/*||multi*/)&&(browsenums>1))
Symptoms of this bug is that helpdeco will choke right after getting to the [GROUPS] section. This change has no side effects from what I noticed.

DOS
User avatar
Posts: 205
Joined: Sun Mar 16, 2014 6:56 am

Re: What about download.microsoft.com, support.microsoft.com

Post by DOS »

Thanks! Unfortunately that doesn't fix the issue I'm hitting :(

Edit: I remembered that I forgot to recompile *hehe* It did help, thanks!

Post Reply