BetaArchive
https://www.betaarchive.com/forum/

Deduplication on the FTP
https://www.betaarchive.com/forum/viewtopic.php?f=2&t=29865
Page 1 of 2

Author:  mrpijey [ Wed Oct 30, 2013 9:13 pm ]
Post subject:  Deduplication on the FTP

I covered delta patches in a previous thread but after some experimenting (got a whole pile of 4TB drives here with experimental files, packings etc on them) I've come to the conclusion that delta patches will not be used on BA.

Yes, delta patches work and it does save a significant amount of space, but there are other factors we have to consider when building the archives for BetaArchive, factors such as:

  1. Management time and resources. Using delta patches work, but it requires a significant amount of time to research and find the best combination. It also adds extra time for you members to unpack these archives, especially if you have to download a source file and a set of delta patches.
  2. Compression size vs compression time. Compressing files takes time, sometimes a few hours for large games. And the savings might be small, if any. In many cases the archive is still larger due to the recovery overhead.
  3. Real archive size vs stored archive size. This may be the most important thing. All these files on BA takes a lot of space, and the archive will grow over time. We've already reached over 5TB and it will not stop there. Server space costs money, bandwidth costs money and archiving and backing it up costs money and a lot of time.

So a balance between these three items has be reached. We either get smaller files (faster downloads for members) but at a loss of storage space and management time (item 1). Or we save on management time but increase the archive size and loss of server space (=increased costs of bandwidth and server) (item 2). Or we do what we really should, utilise certain modern technologies to cut down on space usage, backup time but unfortunately keep the size up. Does it sound paradoxal? Keep size up but keep down disk usage? Let me explain:

You all know by now about xdelta3 and delta compression (otherwise see thread I linked at the beginning). By comparing two files only the data that differs is saved into a new file, thus cutting down on the size of the "patch" files. By later applying the patch file to the source you can recreate the other files by using the differential data and common data to rebuild the files. As good as this may be it has two major drawbacks: It only handles two files at a time (which I have to manually choose for best result), and you have to keep track of the source files and patches, you can't rename either of them or it won't find the files to restore.

So after some research and a very convincing argument from dw5304 here on BA I looked into something called deduplication which is available in Windows Server 2012, 2012 R2 and Windows 8 (with some hacks, doesn't support it by default). It basically does what delta compression does, but in the background, transparently and on the whole volume at once. Which means I don't need to select which files to process, nor keep track of what the source files were etc. By a command it's launched and run in the background (or scheduled automatically), and after some processing you get your result. Same file size, smaller storage area on the harddrive.

But to archieve this no compression can be used whatsoever. This means that music, video and by default compressed data (Installshield archives, games, video, audio etc) will not deduplicate at all. But those won't compress much either so the loss in the end is minimal. So I went ahead and did some tests on BA archives... I chose one of the BA drive that holds compilations and some abandonware stuff (you'll see a full list of tested categories in the screenshot below), unpacked it all and then went ahead and enabled deduplication. Where I achieved 30-40% savings with xdelta compression (which was also painfully complicated because I had to manually select and sort the files to be delta compressed) the savings achieved by deduplication was by far more impressive.

Here's a small example with some of the BetaArchive file sections:

Image
Left side is fully uncompressed archives, right side is the BA archive as it looks today, max compressed RAR files with 5% recovery.

Do you spot the difference? The image speaks all by itself really...

However, we have one feature that we greatly rely on on BA for keeping the archives working and without errors, and that's the recovery data. When uncompressed we lose this functionality, and it also greatly adds file count (since some archives contains hundreds if not thousands of files, files you will have to download one by one instead of the whole archive as one file), this was a situation that I was uncomfortable to live with. I would have no means to verify the integrity of the files. Sure I could use md5's, but that would add one more step to deal with, and I would still need a recovery archive set. An external par2 set would work, but that would further complicate things. For example, if I renamed a release (which happens quite often) I would also need to rename the par2 set, or at the very least rebuild the par2 set if some files changed etc. So I would again regain some extra steps to keep the files intact, something I want to avoid.

So again I went to the "BetaArchive lab" I got and tested various solutions. But of all archive formats rar seems to be the most versatile ones, and we already have batch routines and scripts that work with it (since all BA archives are already compressed with rar). So why not try rar archive, but with no compression at all but still keep 5% recovery archive? Some archive formats spread out the parity information across the entire archive, something that would destroy any chances for a successful deduplication (mind you, the contents of the files within the archives has to remain uncompressed and unaltered, injecting parity information in the data stream would change that), but after reviewing the rar format it tells me that recovery information is added to the end of the archive. Great! So that problem would be solved too, any deduplication would work and only add a 5% overhead (5% recovery which is what we use for RAR-files on BA). Not too shabby.

So currently I am repacking the same files listed in the screenshot above with no compression and with 5% recovery data. I will post the results once I am done, and also a short summary of my findings, results and what will happen with the BA FTP if I choose to go ahead with deduplication.

For now review that screenshot above and tell me what you people think... :).

*pause for effect*

Author:  jagotu [ Wed Oct 30, 2013 9:34 pm ]
Post subject:  Re: Deduplication on the FTP

If I get this right, we will be downloading the full RARs and the server will be sticking it up from various locations on the go?

If that's the case, can BA's server handle the load?

Author:  mrpijey [ Wed Oct 30, 2013 9:39 pm ]
Post subject:  Re: Deduplication on the FTP

From the end users perspective (i.e yours :) ) there won't be any different than now. You download the RAR as it is. And BA doesn't "stick it up" from various locations no, it's stored just as it is now as well. The RAR files are not in any ways split up across volumes etc. so there will not be any more strain on the BA server. In reality there should actually be less strain on BA.

Author:  jagotu [ Wed Oct 30, 2013 9:43 pm ]
Post subject:  Re: Deduplication on the FTP

It will have to save the parts that are the same only once. Which means when I'll be downloading something, it will have to look in some kind of registry which parts are where, and if it removes not only duplicates but also similiar things (like xdelta does), it will have to patch the files on the go, therefore putting some load on the server, doesn't it?

Author:  mrpijey [ Wed Oct 30, 2013 10:11 pm ]
Post subject:  Re: Deduplication on the FTP

Well, all that is done in the background, but it's all just disk I/O. And for all deduplicated data the I/O actually decreases as it has to read less data from the drive when several people download at the same time (which there always are as we always got members logged on at all times downloading files). This is the main advantage of deduplication, especially in high load servers. Less disk activity = faster data access. But to answer your question, yes it has to "patch the files" on the go but it doesn't work like xdelta, it doesn't have to rebuild the files before you can access it, it simply accesses the relevant data blocks on the fly, just as any other data.

Author:  Darkstar [ Wed Oct 30, 2013 11:26 pm ]
Post subject:  Re: Deduplication on the FTP

Deduplication is great. We sell enterprise storage systems for VMware server virtualization, and we regularily see up to 70% or 80% space savings for the virtual machines. That means you can have 3tb worth of virtual machines in a volume of 500 gb size and still have plenty of space for taking weeks worth of snapshot backups.

Enterprise storage even has its own redundancy against corruption to boot. Too bad these babies draw so much power that it's almost impossible (or at least quite expensive) to run them 24/7 at home :(

Author:  mrpijey [ Thu Oct 31, 2013 12:04 am ]
Post subject:  Re: Deduplication on the FTP

Those enterprise storage devices are usually no different than a regular PC except for a custom built case and some extra hardware. But in the core it's just an another system with a CPU, RAM, harddrive and a server OS... You can technically run a "home enterprise server" on a laptop if you want. So it doesn't need to be expensive :). The blade servers I got are just Xeon-based computers and I can run Server 2012 Datacenter on them as well as Windows 3.0 if I want...

But deduplication will be something we'll use on BA. I've been experimenting with it now for a couple of weeks and I am going to repack the entire archive and setup a dedicated BA server here at home that will handle all the files properly. Once I finish my main experiments I will reply with a summary and some additional info for members if needed. But overall there won't be any major changes to members, except that the archive may grow a little bit due to the lack of compression. Everything else will stay the same, such as folder structure, filenames and so on.

Author:  pizzaboy192 [ Thu Oct 31, 2013 4:00 am ]
Post subject:  Re: Deduplication on the FTP

So if I get this correctly, DeDuplication is basically pointing multiple file "handles" to the same location on the disc, thereby making it take up less space because there is only one actual copy of that block of data on the disc, but multiple references to it?

That's rather awesome (reminds me of how I'm understanding Python's variables to work, but that's just a random tie-in)

I wonder what would happen if I were to run that same sort of tool on my desktop's HDDs. I know that there's way too many duplicates of files on it already.

Author:  WinPC [ Thu Oct 31, 2013 5:36 am ]
Post subject:  Re: Deduplication on the FTP

The only problem with this is if the downloaded files are larger, which has me concerned whether it would take longer for the user to download the files once this system is put into place.

Still, it does appear to save storage space on the server side, so whatever is done about this, I hope that it works out well for the site.

Author:  mrpijey [ Thu Oct 31, 2013 8:42 am ]
Post subject:  Re: Deduplication on the FTP

pizzaboy192 wrote:
So if I get this correctly, DeDuplication is basically pointing multiple file "handles" to the same location on the disc, thereby making it take up less space because there is only one actual copy of that block of data on the disc, but multiple references to it?

That's rather awesome (reminds me of how I'm understanding Python's variables to work, but that's just a random tie-in)

I wonder what would happen if I were to run that same sort of tool on my desktop's HDDs. I know that there's way too many duplicates of files on it already.
Yeah, except that it's not managed on a file level, but block level. So even if two files are completely different size it can still deduplicate part of it, if for example both of them has the same setup.exe but the rest differs then the setup.exe will still be deduplicated, without taking apart the ISO.

And deduplicating your desktop HDD would work as long as it's not bootable and runs under a supported OS. But remember, it doesn't work on files that are already heavily compressed such as video and audio.

WinPC wrote:
The only problem with this is if the downloaded files are larger, which has me concerned whether it would take longer for the user to download the files once this system is put into place.

Still, it does appear to save storage space on the server side, so whatever is done about this, I hope that it works out well for the site.
Unfortunately the files do get somewhat larger (if they were heavily compressed before then they will get larger, otherwise it will be a very marginal size increase) so the files users download will take longer to grab. But this is unfortunately the only way to do it. The other solution would be to not archive the file at all as drive space is expensive and difficult to manage. So yes, files will be larger, but we'll also be able to store more of them :).

I am currently repacking the categories I have in the screenshot, will report back once it completes. But the screenshot I provided proves more than enough that it works... I mean, look at it. We got over 2.3TB files on a drive that holds only 1TB. Sure is a lot better than having 1.7TB of files that doesn't fit on anything less than 1.7TB. I have even a better example which I will provide very soon, and you'll be quite amazed of the result...

Author:  hounsell [ Thu Oct 31, 2013 12:08 pm ]
Post subject:  Re: Deduplication on the FTP

As I advocated a long time ago, 2012's Deduplication is nothing short of miraculous.

Here's the last screenshot from when I had an NTFS Deduplicated volume:
Image

Considering this volume was actually mostly (about ~60%) HD Video, with about 25% uncompressed ISOs and about 10% software in various formats, I was seriously impressed with how it managed to get back 34% through dedup. Performance was generally much improved actually. Deduplicating initially took a lot of resources - about 8-12GB RAM (on a machine with 16GB) over a period of about 4 days - but read perf was better after and write did improve. One factor no one's mentioned here is that you're also more likely to hit the cache with deduplicated files, particularly with reads, which makes a huge difference.

Now I've moved to a mirrored storage space though, so I've changed file system from NTFS to ReFS, which currently does not support Dedup.

Author:  WinPC [ Thu Oct 31, 2013 9:10 pm ]
Post subject:  Re: Deduplication on the FTP

mrpijey wrote:
Unfortunately the files do get somewhat larger (if they were heavily compressed before then they will get larger, otherwise it will be a very marginal size increase) so the files users download will take longer to grab. But this is unfortunately the only way to do it. The other solution would be to not archive the file at all as drive space is expensive and difficult to manage. So yes, files will be larger, but we'll also be able to store more of them :).

I am currently repacking the categories I have in the screenshot, will report back once it completes. But the screenshot I provided proves more than enough that it works... I mean, look at it. We got over 2.3TB files on a drive that holds only 1TB. Sure is a lot better than having 1.7TB of files that doesn't fit on anything less than 1.7TB. I have even a better example which I will provide very soon, and you'll be quite amazed of the result...
Would it be possible to run the server on a faster connection, so that downloads don't take as long for the users here? I understand that that most likely isn't possible (especially with the money that is already required to maintain the FTP server itself), but I was just asking anyway, just to be sure.

Also, do you have a list of which files will take significantly longer to download?

Author:  mrpijey [ Thu Oct 31, 2013 9:23 pm ]
Post subject:  Re: Deduplication on the FTP

The BA server is already fast, or are you really getting 100Mbit xfer rates? Remember, it doesn't only require the server to be fast, but also the client... Doesn't matter if the BA server can push 1000GBit if you're on a slow 2Mbit connection...

The bandwidth on BA has never been an issue and we're rarely maxed it out (only during some very special beta releases) so bandwidth isn't an issue.

As for the larger files, no, it all depends on how well they are compressed. The ones that compress well will get larger (as they will be fully uncompressed size), the ones that compress badly will remain roughly the same size as compressed vs uncompressed doesn't differ much. But expect all files to become larger as there's no file that gains 0% compression at full compression setting anyway. It's unfortunate, but it's either that or we stop hosting any more files as we will soon run out of space if we continue as we do now.

Author:  pizzaboy192 [ Thu Oct 31, 2013 10:15 pm ]
Post subject:  Re: Deduplication on the FTP

Other thing to consider is that if you're US based, you've got to deal with the rather slow international interconnects. The BA server is still hosted in Franc (IIRC) so those of us who live "across the pond" will have a reduced speed and increased latency compared to those who aren't located as far away.

Author:  Andy [ Fri Nov 01, 2013 8:39 am ]
Post subject:  Re: Deduplication on the FTP

Unfortunately this is a law of physics we can't change. OVH is improving their international connectivity all the time so you should see improvements over time. It's certainly better than it was a year ago.

Author:  orsg [ Fri Nov 01, 2013 11:19 am ]
Post subject:  Re: Deduplication on the FTP

I did some diving into the file systems space some time ago while building my home NAS. I finally settled with ZFS, mainly because it always checksums everything and can fix stuff automatically (if you have redundancy, otherwise it does at least detect silent data corruption on disk) without manually running md5 or uncompressing archives. It does also offer deduplication, but that blows up ZFS' rather hungry RAM usage quite heavily. The rule of thumb for ZFS is 1GB RAM for every 1TB storage without dedup, although in my case it's actually more like 0.6GB/TB. Enabling dedup is said to require about 5GB RAM/TB Storage, because it has to keep tables in memory to quickly find the duplicates.

How about the memory usage with dedup under Windows?

Author:  Holmes [ Thu Nov 07, 2013 6:01 am ]
Post subject:  Re: Deduplication on the FTP

It's good that you're saving so much space but why aren't you using the vastly superior Linux to run the site?

Author:  x010 [ Thu Nov 07, 2013 6:41 am ]
Post subject:  Re: Deduplication on the FTP

Holmes wrote:
It's good that you're saving so much space but why aren't you using the vastly superior Linux to run the site?

They use IIS as the website base, which is Windows-only.
And also the fact that the FTP Server software(Gene6 if I'm right) does not run on Linux.

Author:  mrpijey [ Thu Nov 07, 2013 8:22 am ]
Post subject:  Re: Deduplication on the FTP

Holmes wrote:
It's good that you're saving so much space but why aren't you using the vastly superior Linux to run the site?

Vastly superior? That may be your opinion of course, but not everyone elses. The entire server platform is based around Windows because it's easier for us to setup, admin and diagnose, and the software and functions we need to make everything work would be a lot more complicated to setup on the Linux platform. Windows Server does the hosting job very well and we have no reasons to change it and make things more complicated than they already are.

Author:  orsg [ Fri Nov 08, 2013 12:15 am ]
Post subject:  Re: Deduplication on the FTP

mrpijey wrote:
Vastly superior? That may be your opinion of course, but not everyone elses.

At least in the file systems space, other platforms offer better solutions than good old NTFS. Administration and software stack are highly dependent on personal (dis)like, of course.

Author:  ppc_digger [ Fri Nov 08, 2013 4:18 pm ]
Post subject:  Re: Deduplication on the FTP

If possible, you should try to make the files inside the RAR archives align with the dedup sector size to maximize dedup gains.

Also, deduplication without sufficient RAM makes i/o operations slow to a crawl during writes (that's true for every deduplication system), so make sure you have enough and keep the original compressed RARs in case you change your mind.

Author:  Andy [ Fri Nov 08, 2013 4:20 pm ]
Post subject:  Re: Deduplication on the FTP

The server we're going to be using has at least 8GB of RAM and is completely dedicated to the task. I'm more concerned about CPU usage personally as I haven't seen any real use data from mrpijey but we'll figure something out if it's an issue at all.

Author:  orsg [ Sat Nov 09, 2013 9:01 pm ]
Post subject:  Re: Deduplication on the FTP

Andy wrote:
The server we're going to be using has at least 8GB of RAM and is completely dedicated to the task. I'm more concerned about CPU usage personally as I haven't seen any real use data from mrpijey but we'll figure something out if it's an issue at all.

Usually dedup works on a gigantic hash table, so as long as the whole table fits into RAM, those lookups should be near to instantaneous. In case of ZFS, 8GB RAM would last for deduplicating at most 2TB worth of actually stored data, probably less. An alternative is swapping out the dedup table to SSD.
But even if the impact would be noticeable, it's only relevant for writing. I guess the ftp here is mostly read, which does not differ if there's deduplication in place or not.

Author:  mrpijey [ Sun Nov 10, 2013 2:36 am ]
Post subject:  Re: Deduplication on the FTP

ZFS is vastly inferior when it comes to deduplication due to its insane memory usage. If we would have used ZFS our server would have needed some 60-80GB RAM, which is completely ridiculous, but I think ZFS can do with a lot less RAM than that. The Windows deduplication service surely does as the memory usage (so far) doesn't exceed 3GB usage for deduplicating over 8TB of data (also deduplicating other stuff than just BA). And I've compared ZFS deduplication to NTFS deduplication and (so far) the results has been the same but with vastly higher memory requirements for ZFS. There's no real magic to deduplication, only how it's managed, processed and stored with the file system in use. ZFS has other advantages of course as it is a superior file system to NTFS, but not anything BetaArchive would have any needs for.

Also, "dedup sector size"? Dedup isn't a file system, there is no "sector size", nor can I affect the contents of the RAR files except add or remove files (which is of course not wanted). To maximize deduplication efficiency I simply have to make sure the data stream is unaltered when all the files are put together into a rar file, something you achieve with using zero compression (i.e "store"). And the BA server is by 99% dependent on reading operations, writing is only done during additions or repacks, so it's not a big issue. And deduplication operations can be moved to any time of the day so we can schedule it during the times with the lowest activity on BA. Or we can simply turn off the FTP during deduplication but that is not needed as the server can handle the load just fine.

As for Linux advantages over Windows those advantages becomes disadvantages if you spend more time configuring and managing the same services and features on one OS than you would with the other, given that the end results would be the same. As we don't have any performance bottlenecks at BA there's no gains to be had by using a different operating system. As the saying goes, time is money, and if you spend more to make the same (or less) then it's not worth it. BetaArchive will never use Linux as a host system because there is simply more trouble to get it work properly than it's worth, and the gains of such change would be very small, if any. No point arguing about it however.

Author:  ppc_digger [ Wed Nov 13, 2013 11:11 pm ]
Post subject:  Re: Deduplication on the FTP

mrpijey wrote:
Also, "dedup sector size"? Dedup isn't a file system, there is no "sector size"

I meant chunk size, my mistake.

Deduplicating file systems split your data into chunks and hash them. Whenever a new chunk needs to be written it is compared to the other chunks in the deduplication table. If you align the files inside the uncompressed archives to match the chunk size (32-128 KB in this case) you increase the chance for dedup hits and thus increase your dedup savings.

Also, NTFS deduplication is less memory intensive than ZFS because it dedups asynchronously, while ZFS dedups while during writes (so the entire dedup table has to fit in RAM at all times).

EDIT: I looked a bit into controlling file alignment inside RAR archives, and there seems to be no easy way to do it.

Page 1 of 2 All times are UTC [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/