BetaArchive Logo
Navigation Home Database Screenshots Gallery Image Uploader Server Info FTP Servers Wiki Forum RSS Feed Rules Please Donate
UP: 14d, 18h, 19m | CPU: 51% | MEM: 5823MB of 12060MB used
{The community for beta collectors}

Post new topic Reply to topic  [ 33 posts ]  Go to page 1, 2  Next
Author Message
 PostPost subject: Deduplication on the FTP        Posted: Wed Oct 30, 2013 9:13 pm 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
I covered delta patches in a previous thread but after some experimenting (got a whole pile of 4TB drives here with experimental files, packings etc on them) I've come to the conclusion that delta patches will not be used on BA.

Yes, delta patches work and it does save a significant amount of space, but there are other factors we have to consider when building the archives for BetaArchive, factors such as:

  1. Management time and resources. Using delta patches work, but it requires a significant amount of time to research and find the best combination. It also adds extra time for you members to unpack these archives, especially if you have to download a source file and a set of delta patches.
  2. Compression size vs compression time. Compressing files takes time, sometimes a few hours for large games. And the savings might be small, if any. In many cases the archive is still larger due to the recovery overhead.
  3. Real archive size vs stored archive size. This may be the most important thing. All these files on BA takes a lot of space, and the archive will grow over time. We've already reached over 5TB and it will not stop there. Server space costs money, bandwidth costs money and archiving and backing it up costs money and a lot of time.

So a balance between these three items has be reached. We either get smaller files (faster downloads for members) but at a loss of storage space and management time (item 1). Or we save on management time but increase the archive size and loss of server space (=increased costs of bandwidth and server) (item 2). Or we do what we really should, utilise certain modern technologies to cut down on space usage, backup time but unfortunately keep the size up. Does it sound paradoxal? Keep size up but keep down disk usage? Let me explain:

You all know by now about xdelta3 and delta compression (otherwise see thread I linked at the beginning). By comparing two files only the data that differs is saved into a new file, thus cutting down on the size of the "patch" files. By later applying the patch file to the source you can recreate the other files by using the differential data and common data to rebuild the files. As good as this may be it has two major drawbacks: It only handles two files at a time (which I have to manually choose for best result), and you have to keep track of the source files and patches, you can't rename either of them or it won't find the files to restore.

So after some research and a very convincing argument from dw5304 here on BA I looked into something called deduplication which is available in Windows Server 2012, 2012 R2 and Windows 8 (with some hacks, doesn't support it by default). It basically does what delta compression does, but in the background, transparently and on the whole volume at once. Which means I don't need to select which files to process, nor keep track of what the source files were etc. By a command it's launched and run in the background (or scheduled automatically), and after some processing you get your result. Same file size, smaller storage area on the harddrive.

But to archieve this no compression can be used whatsoever. This means that music, video and by default compressed data (Installshield archives, games, video, audio etc) will not deduplicate at all. But those won't compress much either so the loss in the end is minimal. So I went ahead and did some tests on BA archives... I chose one of the BA drive that holds compilations and some abandonware stuff (you'll see a full list of tested categories in the screenshot below), unpacked it all and then went ahead and enabled deduplication. Where I achieved 30-40% savings with xdelta compression (which was also painfully complicated because I had to manually select and sort the files to be delta compressed) the savings achieved by deduplication was by far more impressive.

Here's a small example with some of the BetaArchive file sections:

Image
Left side is fully uncompressed archives, right side is the BA archive as it looks today, max compressed RAR files with 5% recovery.

Do you spot the difference? The image speaks all by itself really...

However, we have one feature that we greatly rely on on BA for keeping the archives working and without errors, and that's the recovery data. When uncompressed we lose this functionality, and it also greatly adds file count (since some archives contains hundreds if not thousands of files, files you will have to download one by one instead of the whole archive as one file), this was a situation that I was uncomfortable to live with. I would have no means to verify the integrity of the files. Sure I could use md5's, but that would add one more step to deal with, and I would still need a recovery archive set. An external par2 set would work, but that would further complicate things. For example, if I renamed a release (which happens quite often) I would also need to rename the par2 set, or at the very least rebuild the par2 set if some files changed etc. So I would again regain some extra steps to keep the files intact, something I want to avoid.

So again I went to the "BetaArchive lab" I got and tested various solutions. But of all archive formats rar seems to be the most versatile ones, and we already have batch routines and scripts that work with it (since all BA archives are already compressed with rar). So why not try rar archive, but with no compression at all but still keep 5% recovery archive? Some archive formats spread out the parity information across the entire archive, something that would destroy any chances for a successful deduplication (mind you, the contents of the files within the archives has to remain uncompressed and unaltered, injecting parity information in the data stream would change that), but after reviewing the rar format it tells me that recovery information is added to the end of the archive. Great! So that problem would be solved too, any deduplication would work and only add a 5% overhead (5% recovery which is what we use for RAR-files on BA). Not too shabby.

So currently I am repacking the same files listed in the screenshot above with no compression and with 5% recovery data. I will post the results once I am done, and also a short summary of my findings, results and what will happen with the BA FTP if I choose to go ahead with deduplication.

For now review that screenshot above and tell me what you people think... :).

*pause for effect*

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Wed Oct 30, 2013 9:34 pm 
Reply with quote
FTP Access
User avatar
Offline

Joined
Mon Feb 04, 2013 5:03 pm

Posts
505

Location
Czechia

Favourite OS
Development Release #5
If I get this right, we will be downloading the full RARs and the server will be sticking it up from various locations on the go?

If that's the case, can BA's server handle the load?

_________________
Windows TEN - Totally Erroneous Numbering
Always watching you...


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Wed Oct 30, 2013 9:39 pm 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
From the end users perspective (i.e yours :) ) there won't be any different than now. You download the RAR as it is. And BA doesn't "stick it up" from various locations no, it's stored just as it is now as well. The RAR files are not in any ways split up across volumes etc. so there will not be any more strain on the BA server. In reality there should actually be less strain on BA.

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Wed Oct 30, 2013 9:43 pm 
Reply with quote
FTP Access
User avatar
Offline

Joined
Mon Feb 04, 2013 5:03 pm

Posts
505

Location
Czechia

Favourite OS
Development Release #5
It will have to save the parts that are the same only once. Which means when I'll be downloading something, it will have to look in some kind of registry which parts are where, and if it removes not only duplicates but also similiar things (like xdelta does), it will have to patch the files on the go, therefore putting some load on the server, doesn't it?

_________________
Windows TEN - Totally Erroneous Numbering
Always watching you...


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Wed Oct 30, 2013 10:11 pm 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
Well, all that is done in the background, but it's all just disk I/O. And for all deduplicated data the I/O actually decreases as it has to read less data from the drive when several people download at the same time (which there always are as we always got members logged on at all times downloading files). This is the main advantage of deduplication, especially in high load servers. Less disk activity = faster data access. But to answer your question, yes it has to "patch the files" on the go but it doesn't work like xdelta, it doesn't have to rebuild the files before you can access it, it simply accesses the relevant data blocks on the fly, just as any other data.

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Wed Oct 30, 2013 11:26 pm 
Reply with quote
Donator
User avatar
Offline

Joined
Fri May 14, 2010 1:29 pm

Posts
865

Location
Southern Germany

Favourite OS
IRIX 5.3
Deduplication is great. We sell enterprise storage systems for VMware server virtualization, and we regularily see up to 70% or 80% space savings for the virtual machines. That means you can have 3tb worth of virtual machines in a volume of 500 gb size and still have plenty of space for taking weeks worth of snapshot backups.

Enterprise storage even has its own redundancy against corruption to boot. Too bad these babies draw so much power that it's almost impossible (or at least quite expensive) to run them 24/7 at home :(

_________________
I upload stuff to archive.org from time to time. See here for everything that doesn't fit BA


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Oct 31, 2013 12:04 am 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
Those enterprise storage devices are usually no different than a regular PC except for a custom built case and some extra hardware. But in the core it's just an another system with a CPU, RAM, harddrive and a server OS... You can technically run a "home enterprise server" on a laptop if you want. So it doesn't need to be expensive :). The blade servers I got are just Xeon-based computers and I can run Server 2012 Datacenter on them as well as Windows 3.0 if I want...

But deduplication will be something we'll use on BA. I've been experimenting with it now for a couple of weeks and I am going to repack the entire archive and setup a dedicated BA server here at home that will handle all the files properly. Once I finish my main experiments I will reply with a summary and some additional info for members if needed. But overall there won't be any major changes to members, except that the archive may grow a little bit due to the lack of compression. Everything else will stay the same, such as folder structure, filenames and so on.

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Oct 31, 2013 4:00 am 
Reply with quote
Staff
User avatar
Offline

Joined
Thu Oct 23, 2008 3:25 am

Posts
2688

Location
Earth.

Favourite OS
Real Life
So if I get this correctly, DeDuplication is basically pointing multiple file "handles" to the same location on the disc, thereby making it take up less space because there is only one actual copy of that block of data on the disc, but multiple references to it?

That's rather awesome (reminds me of how I'm understanding Python's variables to work, but that's just a random tie-in)

I wonder what would happen if I were to run that same sort of tool on my desktop's HDDs. I know that there's way too many duplicates of files on it already.

_________________
Visit my BLOG!
Wanna play a fun browser based game that plays while you're away? Click here.


Top  Profile  WWW  YIM
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Oct 31, 2013 5:36 am 
Reply with quote
Donator
User avatar
Offline

Joined
Thu Nov 29, 2007 11:33 pm

Posts
3899

Location
Where do you want to go today?

Favourite OS
All Microsoft operating systems!
The only problem with this is if the downloaded files are larger, which has me concerned whether it would take longer for the user to download the files once this system is put into place.

Still, it does appear to save storage space on the server side, so whatever is done about this, I hope that it works out well for the site.


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Oct 31, 2013 8:42 am 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
pizzaboy192 wrote:
So if I get this correctly, DeDuplication is basically pointing multiple file "handles" to the same location on the disc, thereby making it take up less space because there is only one actual copy of that block of data on the disc, but multiple references to it?

That's rather awesome (reminds me of how I'm understanding Python's variables to work, but that's just a random tie-in)

I wonder what would happen if I were to run that same sort of tool on my desktop's HDDs. I know that there's way too many duplicates of files on it already.
Yeah, except that it's not managed on a file level, but block level. So even if two files are completely different size it can still deduplicate part of it, if for example both of them has the same setup.exe but the rest differs then the setup.exe will still be deduplicated, without taking apart the ISO.

And deduplicating your desktop HDD would work as long as it's not bootable and runs under a supported OS. But remember, it doesn't work on files that are already heavily compressed such as video and audio.

WinPC wrote:
The only problem with this is if the downloaded files are larger, which has me concerned whether it would take longer for the user to download the files once this system is put into place.

Still, it does appear to save storage space on the server side, so whatever is done about this, I hope that it works out well for the site.
Unfortunately the files do get somewhat larger (if they were heavily compressed before then they will get larger, otherwise it will be a very marginal size increase) so the files users download will take longer to grab. But this is unfortunately the only way to do it. The other solution would be to not archive the file at all as drive space is expensive and difficult to manage. So yes, files will be larger, but we'll also be able to store more of them :).

I am currently repacking the categories I have in the screenshot, will report back once it completes. But the screenshot I provided proves more than enough that it works... I mean, look at it. We got over 2.3TB files on a drive that holds only 1TB. Sure is a lot better than having 1.7TB of files that doesn't fit on anything less than 1.7TB. I have even a better example which I will provide very soon, and you'll be quite amazed of the result...

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Oct 31, 2013 12:08 pm 
Reply with quote
Donator
Offline

Joined
Sat Feb 24, 2007 4:14 pm

Posts
6612

Location
United Kingdom

Favourite OS
Server 2012 R2
As I advocated a long time ago, 2012's Deduplication is nothing short of miraculous.

Here's the last screenshot from when I had an NTFS Deduplicated volume:
Image

Considering this volume was actually mostly (about ~60%) HD Video, with about 25% uncompressed ISOs and about 10% software in various formats, I was seriously impressed with how it managed to get back 34% through dedup. Performance was generally much improved actually. Deduplicating initially took a lot of resources - about 8-12GB RAM (on a machine with 16GB) over a period of about 4 days - but read perf was better after and write did improve. One factor no one's mentioned here is that you're also more likely to hit the cache with deduplicated files, particularly with reads, which makes a huge difference.

Now I've moved to a mirrored storage space though, so I've changed file system from NTFS to ReFS, which currently does not support Dedup.

_________________
BuildFeed - the ultimate collaborative NT build list - Windows Longhorn - a look at a defining Microsoft project


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Oct 31, 2013 9:10 pm 
Reply with quote
Donator
User avatar
Offline

Joined
Thu Nov 29, 2007 11:33 pm

Posts
3899

Location
Where do you want to go today?

Favourite OS
All Microsoft operating systems!
mrpijey wrote:
Unfortunately the files do get somewhat larger (if they were heavily compressed before then they will get larger, otherwise it will be a very marginal size increase) so the files users download will take longer to grab. But this is unfortunately the only way to do it. The other solution would be to not archive the file at all as drive space is expensive and difficult to manage. So yes, files will be larger, but we'll also be able to store more of them :).

I am currently repacking the categories I have in the screenshot, will report back once it completes. But the screenshot I provided proves more than enough that it works... I mean, look at it. We got over 2.3TB files on a drive that holds only 1TB. Sure is a lot better than having 1.7TB of files that doesn't fit on anything less than 1.7TB. I have even a better example which I will provide very soon, and you'll be quite amazed of the result...
Would it be possible to run the server on a faster connection, so that downloads don't take as long for the users here? I understand that that most likely isn't possible (especially with the money that is already required to maintain the FTP server itself), but I was just asking anyway, just to be sure.

Also, do you have a list of which files will take significantly longer to download?


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Oct 31, 2013 9:23 pm 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
The BA server is already fast, or are you really getting 100Mbit xfer rates? Remember, it doesn't only require the server to be fast, but also the client... Doesn't matter if the BA server can push 1000GBit if you're on a slow 2Mbit connection...

The bandwidth on BA has never been an issue and we're rarely maxed it out (only during some very special beta releases) so bandwidth isn't an issue.

As for the larger files, no, it all depends on how well they are compressed. The ones that compress well will get larger (as they will be fully uncompressed size), the ones that compress badly will remain roughly the same size as compressed vs uncompressed doesn't differ much. But expect all files to become larger as there's no file that gains 0% compression at full compression setting anyway. It's unfortunate, but it's either that or we stop hosting any more files as we will soon run out of space if we continue as we do now.

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Oct 31, 2013 10:15 pm 
Reply with quote
Staff
User avatar
Offline

Joined
Thu Oct 23, 2008 3:25 am

Posts
2688

Location
Earth.

Favourite OS
Real Life
Other thing to consider is that if you're US based, you've got to deal with the rather slow international interconnects. The BA server is still hosted in Franc (IIRC) so those of us who live "across the pond" will have a reduced speed and increased latency compared to those who aren't located as far away.

_________________
Visit my BLOG!
Wanna play a fun browser based game that plays while you're away? Click here.


Top  Profile  WWW  YIM
 PostPost subject: Re: Deduplication on the FTP        Posted: Fri Nov 01, 2013 8:39 am 
Reply with quote
Administrator
User avatar
Offline

Joined
Fri Aug 18, 2006 11:47 am

Posts
12564

Location
Merseyside, United Kingdom

Favourite OS
Microsoft Windows 7 Ultimate x64
Unfortunately this is a law of physics we can't change. OVH is improving their international connectivity all the time so you should see improvements over time. It's certainly better than it was a year ago.

_________________
Image

BetaArchive Discord: https://discord.gg/epK3r6A


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Fri Nov 01, 2013 11:19 am 
Reply with quote
FTP Access
Offline

Joined
Tue Sep 21, 2010 12:47 pm

Posts
240

Favourite OS
IRIX 5.3 XFS 12/94
I did some diving into the file systems space some time ago while building my home NAS. I finally settled with ZFS, mainly because it always checksums everything and can fix stuff automatically (if you have redundancy, otherwise it does at least detect silent data corruption on disk) without manually running md5 or uncompressing archives. It does also offer deduplication, but that blows up ZFS' rather hungry RAM usage quite heavily. The rule of thumb for ZFS is 1GB RAM for every 1TB storage without dedup, although in my case it's actually more like 0.6GB/TB. Enabling dedup is said to require about 5GB RAM/TB Storage, because it has to keep tables in memory to quickly find the duplicates.

How about the memory usage with dedup under Windows?


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Nov 07, 2013 6:01 am 
Reply with quote
User avatar
Offline

Joined
Sat Sep 28, 2013 1:28 am

Posts
75

Location
Mexico
It's good that you're saving so much space but why aren't you using the vastly superior Linux to run the site?

_________________
This is my account. There are others like it but this one is mine.


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Nov 07, 2013 6:41 am 
Reply with quote
FTP Access
Offline

Joined
Thu Jun 13, 2013 4:46 pm

Posts
979
Holmes wrote:
It's good that you're saving so much space but why aren't you using the vastly superior Linux to run the site?

They use IIS as the website base, which is Windows-only.
And also the fact that the FTP Server software(Gene6 if I'm right) does not run on Linux.


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Nov 07, 2013 8:22 am 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
Holmes wrote:
It's good that you're saving so much space but why aren't you using the vastly superior Linux to run the site?

Vastly superior? That may be your opinion of course, but not everyone elses. The entire server platform is based around Windows because it's easier for us to setup, admin and diagnose, and the software and functions we need to make everything work would be a lot more complicated to setup on the Linux platform. Windows Server does the hosting job very well and we have no reasons to change it and make things more complicated than they already are.

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Fri Nov 08, 2013 12:15 am 
Reply with quote
FTP Access
Offline

Joined
Tue Sep 21, 2010 12:47 pm

Posts
240

Favourite OS
IRIX 5.3 XFS 12/94
mrpijey wrote:
Vastly superior? That may be your opinion of course, but not everyone elses.

At least in the file systems space, other platforms offer better solutions than good old NTFS. Administration and software stack are highly dependent on personal (dis)like, of course.


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Fri Nov 08, 2013 4:18 pm 
Reply with quote
Donator
Offline

Joined
Sat Aug 19, 2006 1:25 am

Posts
590

Location
Israel
If possible, you should try to make the files inside the RAR archives align with the dedup sector size to maximize dedup gains.

Also, deduplication without sufficient RAM makes i/o operations slow to a crawl during writes (that's true for every deduplication system), so make sure you have enough and keep the original compressed RARs in case you change your mind.


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Fri Nov 08, 2013 4:20 pm 
Reply with quote
Administrator
User avatar
Offline

Joined
Fri Aug 18, 2006 11:47 am

Posts
12564

Location
Merseyside, United Kingdom

Favourite OS
Microsoft Windows 7 Ultimate x64
The server we're going to be using has at least 8GB of RAM and is completely dedicated to the task. I'm more concerned about CPU usage personally as I haven't seen any real use data from mrpijey but we'll figure something out if it's an issue at all.

_________________
Image

BetaArchive Discord: https://discord.gg/epK3r6A


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Sat Nov 09, 2013 9:01 pm 
Reply with quote
FTP Access
Offline

Joined
Tue Sep 21, 2010 12:47 pm

Posts
240

Favourite OS
IRIX 5.3 XFS 12/94
Andy wrote:
The server we're going to be using has at least 8GB of RAM and is completely dedicated to the task. I'm more concerned about CPU usage personally as I haven't seen any real use data from mrpijey but we'll figure something out if it's an issue at all.

Usually dedup works on a gigantic hash table, so as long as the whole table fits into RAM, those lookups should be near to instantaneous. In case of ZFS, 8GB RAM would last for deduplicating at most 2TB worth of actually stored data, probably less. An alternative is swapping out the dedup table to SSD.
But even if the impact would be noticeable, it's only relevant for writing. I guess the ftp here is mostly read, which does not differ if there's deduplication in place or not.


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Sun Nov 10, 2013 2:36 am 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
ZFS is vastly inferior when it comes to deduplication due to its insane memory usage. If we would have used ZFS our server would have needed some 60-80GB RAM, which is completely ridiculous, but I think ZFS can do with a lot less RAM than that. The Windows deduplication service surely does as the memory usage (so far) doesn't exceed 3GB usage for deduplicating over 8TB of data (also deduplicating other stuff than just BA). And I've compared ZFS deduplication to NTFS deduplication and (so far) the results has been the same but with vastly higher memory requirements for ZFS. There's no real magic to deduplication, only how it's managed, processed and stored with the file system in use. ZFS has other advantages of course as it is a superior file system to NTFS, but not anything BetaArchive would have any needs for.

Also, "dedup sector size"? Dedup isn't a file system, there is no "sector size", nor can I affect the contents of the RAR files except add or remove files (which is of course not wanted). To maximize deduplication efficiency I simply have to make sure the data stream is unaltered when all the files are put together into a rar file, something you achieve with using zero compression (i.e "store"). And the BA server is by 99% dependent on reading operations, writing is only done during additions or repacks, so it's not a big issue. And deduplication operations can be moved to any time of the day so we can schedule it during the times with the lowest activity on BA. Or we can simply turn off the FTP during deduplication but that is not needed as the server can handle the load just fine.

As for Linux advantages over Windows those advantages becomes disadvantages if you spend more time configuring and managing the same services and features on one OS than you would with the other, given that the end results would be the same. As we don't have any performance bottlenecks at BA there's no gains to be had by using a different operating system. As the saying goes, time is money, and if you spend more to make the same (or less) then it's not worth it. BetaArchive will never use Linux as a host system because there is simply more trouble to get it work properly than it's worth, and the gains of such change would be very small, if any. No point arguing about it however.

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Wed Nov 13, 2013 11:11 pm 
Reply with quote
Donator
Offline

Joined
Sat Aug 19, 2006 1:25 am

Posts
590

Location
Israel
mrpijey wrote:
Also, "dedup sector size"? Dedup isn't a file system, there is no "sector size"

I meant chunk size, my mistake.

Deduplicating file systems split your data into chunks and hash them. Whenever a new chunk needs to be written it is compared to the other chunks in the deduplication table. If you align the files inside the uncompressed archives to match the chunk size (32-128 KB in this case) you increase the chance for dedup hits and thus increase your dedup savings.

Also, NTFS deduplication is less memory intensive than ZFS because it dedups asynchronously, while ZFS dedups while during writes (so the entire dedup table has to fit in RAM at all times).

EDIT: I looked a bit into controlling file alignment inside RAR archives, and there seems to be no easy way to do it.


Top  Profile
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 33 posts ]  Go to page 1, 2  Next




Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

All views expressed in these forums are those of the author and do not necessarily represent the views of the BetaArchive site owner.

Powered by phpBB® Forum Software © phpBB Group

Copyright © 2006-2019

 

Sitemap | XML | RSS