BetaArchive Logo
Navigation Home Database Screenshots Gallery Image Uploader Server Info FTP Servers Wiki Forum RSS Feed Rules Please Donate
UP: 13d, 1h, 54m | CPU: 25% | MEM: 5328MB of 11537MB used
{The community for beta collectors}

Post new topic Reply to topic  [ 33 posts ]  Go to page Previous  1, 2
Author Message
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Nov 14, 2013 12:36 am 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
In the end I don't think the effort of chunk alignment will do any major improvements since the chunks are so small anyway, it would do a lot more sense with larger ones (i.e 512KB and up). I've looked into the chunk size (i.e deduplication granularity as it's called) but there doesn't seem to be any way to change it in Windows Server. The only thing we can do at this point is to keep the archives uncompressed. RAR, as well as other archivers are not built to specifically align data within the data structures so it's a moot point anyway, unless we would devise our own way of zero compressing the files, but that wouldn't be good either since it would be highly proprietary just for BA, which again is bad.

I've already gotten impressive results with the BA archive (I am still in the process of repacking it all) so I think we go with it for now. The importance is that we don't keep the contents of each release separated by files and that we don't compress it. Larger files yes, longer time to download yes, but in the end I have to prioritize server space usage rather than end user comfort. And as contradictory it sounds by making the files larger I've managed to reduce the size of the actual storage size on the drive, which is what I wanted to accomplish in the end.

Once the repack is done I will make a summary and present it here on BA.

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Sun Nov 17, 2013 9:57 am 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
Well, I have just about finished the repack... the results:

Code:
Archive:      | Current: | Repack:  |
-------------------------------------
Size (files): | 5.26TB   | 7.07TB   |
Size (disk):  | 5.26TB   | 3.50TB   |
-------------------------------------

Size (files) being the actual filesize,
Size (disk) being the amount of space the files occupy on the drive.

Now before you choke on your pretzel let me explain why the archive has grown so large and why it's needed:

To fully accomodate the deduplication feature of Windows Server 2012 R2 the server has to properly index all data chunks and optimize every duplicate chunk to maximize efficiency. But when you do maximum compression (or any level of compression for that matter) you're basically creating unique chunks that is impossible to optimize. The whole idea about compression is to replace every duplicate occurance of a data chunk within the file and replace it with an index, exactly how deduplication work except that deduplication work on a volume level, not file level. So by improving this "compression" we have to optimize the "compression" across the whole volume, which is exactly what deduplication does. But to make sure that these chunks match across the whole volume we can't use regular compression. Instead I had to unpack all the files and keep them in their original uncompressed state.

This however created a new problem: File count. Some releases are just a single ISO file, other releases can be thousands of files. A nightmare when it comes to making sure no files are damaged, I would say a near impossibility without using exotic file systems or other means (individual folder-based parity checking etc). So my idea was to simply bring the best of the old world and the new: Use RAR as the compression format, but without compression. This way I can add the recovery data to the archive, I can gather each release into a single file and still include all the bits and pieces I want with each release.

So what exactly happened then to cause the archive to grow? Simple, files that were easily compressed and shrunken in size in the old archive has grown into its original and uncompressed size. The files that were not compressed very well changed very little in size, in some cases even got smaller (due to compression overhead). So many releases became a lot larger when recompressed. This unfortunately means that you will have to download a lot more to get the same amount of files. But at this point it's a necessary evil because drive space is expensive, and so are servers that holds a lot of drive space. And to make sure we can store more stuff we have to reduce the amount of data written to the drives. As you see in the summary at the top, the archive grew by 1.8TB which is quite a lot. But at the same time the written data on the drive was reduced by 1.75TB. That means simply that we saved 1.75TB of space that we can use for future releases, and even more since any addition to the FTP will be processed through the deduplication and optimized. So by recompressing we managed to save space, even if it means longer downloads for you...

The archive will grow a lot in the near future due to the upcoming Windows XP, Server 2003 and other releases. Just to make you understand the kind of savings we're about to make: The upcoming Windows XP/Server 2003 releases alone take up almost 600GB. If we used the current compression method (maximum compression) that would actually mean 600GB more data to the FTP. But if we deduplicate this by recompression the same release will not take up more than 30GB on the disk! That's a saving of 570GB alone! And this get crazier and more insane as we continue with other releases in the future... Just as a side note, I managed to squeeze in over 12TB of ISO files on a single 1TB harddrive, just by not compressing it. Power of deduplication!

I hope you all now understand why this had to be done.

---

The new FTP layout is not yet online, with this deduplication effort I have also taken the opportunity to upgrade the server hardware (the FTP server, not the main BA server) and its software, so there are still some things that need to fall in place before it's operational. Andy and I will work on shifting the servers in the near future, I only have to make sure everything works so there won't be any interruptions of the FTP service. We will make an announcement when the new server is online and ready to go, the only immediate difference you will see is a larger archive size reported by the site as it will report the status of the new repacked FTP and not the old one. So don't be alarmed, I have not added any new releases yet, I've even removed a few duplicates I found during the recompression process. As soon as everything is online and ready I will continue with processing new releases.

Finally, here's a bit of dedup stats for you people that know how to read it :) :

Code:
Volume      SavedSpace      SavingsRate
------      ----------      -----------
E:          1.82 TB         51 %
F:          1.52 TB         45 %
E: and F: are the two 2TB drives the entire FTP resides on.

Questions? Let's hear them... :).

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Tue Mar 18, 2014 10:07 pm 
Reply with quote
FTP Access
User avatar
Offline

Joined
Sun Mar 16, 2014 6:56 am

Posts
153

Favourite OS
DOS
Nice, I wouldn't mind using this at home except I don't have enough RAM! It would probably be cheaper for me to just get more disk :)

ppc_digger wrote:
Deduplicating file systems split your data into chunks and hash them. Whenever a new chunk needs to be written it is compared to the other chunks in the deduplication table. If you align the files inside the uncompressed archives to match the chunk size (32-128 KB in this case) you increase the chance for dedup hits and thus increase your dedup savings.

Some types of deduplication technology don't have all of their chunks at fixed alignments, they are smart enough to guess where they should be in order to be more likely to match an existing chunk. For example, maybe they understand RAR and ISO directory listings, so they can start a new chunk wherever a new file starts in the archive. I don't know specifically if Microsoft's does this.

mrpijey wrote:
This unfortunately means that you will have to download a lot more to get the same amount of files.

I know some web servers will perform runtime compression for clients whose headers say they support it using a "Accept-Encoding:" header. Perhaps something like that could make up for the lack of compression? I've always seen gzip mentioned, which I guess is nowhere near as good as RAR's compression, but it might be better than nothing. Of course, it would consume some CPU on the server.


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Fri Mar 21, 2014 12:37 pm 
Reply with quote
FTP Access
User avatar
Offline

Joined
Thu Feb 20, 2014 8:22 pm

Posts
160

Location
Germany

Favourite OS
Mac OS 10.3
Runtime compression is a feature of HTTP servers. They negotiate if the client's browser supports that feature and send the requested files compressed, the client's browser then uncompresses it – absolutely transparent. It works well because text files are easy to compress and small, the effort is minimal (especially because compressed output is of course cached) and the results are significant. And Gzip supports stream compression, which means the file doesn't need to be randomly accessible, but can be in transit.

This however is a FTP server. There is no standard compression feature in the FTP protocol. It *is* possible, but the FTP server would have to know what it has to do: It would need to be aware that its files are stored uncompressed and give fake a directory listing with already compressed files. Whenever a client requests a file, it could stream-compress the file (I think RAR doesn't support that) and send them to the client, which receives a finished archive. But you couldn't easily cache the many gigabytes, so the server would have to do it over and over.

This would also be different because it's not transparent transport compression, but a preflight package. Which is okay, if you want to keep compressed files for storage anyway.

_________________
Please don't forget to be respectful to each other.


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Fri Mar 21, 2014 12:38 pm 
Reply with quote
Administrator
User avatar
Offline

Joined
Fri Aug 18, 2006 11:47 am

Posts
12564

Location
Merseyside, United Kingdom

Favourite OS
Microsoft Windows 7 Ultimate x64
On the fly compression is also very CPU intensive so it will never be considered for large files.

_________________
Image

BetaArchive Discord: https://discord.gg/epK3r6A


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Wed Jul 16, 2014 5:21 pm 
Reply with quote
Donator
User avatar
Offline

Joined
Mon Jun 30, 2014 8:41 am

Posts
54

Location
Shanghai,China
Mr. mrpijey, I believe that it would be better to get such an archive to dump removed files.
I saw the FTP logs and I found something rather interesting: Some of Office 14 betas are removed... don't know why.


Top  Profile
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Jul 17, 2014 9:29 am 
Reply with quote
Administrator
User avatar
Offline

Joined
Tue Feb 12, 2008 5:28 pm

Posts
7935
What does this have to do with deduplication? Please keep on topic... If you have a report or question to make post it in the appropriate forum, or PM me directly.

_________________
Image
Official guidelines: The Definitive Guide to BetaArchive :: Abandonware
Tools: Alcohol120% (Portable) :: DiscImageCreator
Listings: BetaArchive Database (beta)
Channels: Discord :: Twitter


Top  Profile  WWW
 PostPost subject: Re: Deduplication on the FTP        Posted: Thu Jan 29, 2015 7:01 pm 
Reply with quote
FTP Access
User avatar
Offline

Joined
Fri Jul 01, 2011 3:04 am

Posts
352
@mrpijey: What are the deduplication numbers now that the FTP has surpassed 14TB?


Top  Profile
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 33 posts ]  Go to page Previous  1, 2




Who is online

Users browsing this forum: No registered users and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

All views expressed in these forums are those of the author and do not necessarily represent the views of the BetaArchive site owner.

Powered by phpBB® Forum Software © phpBB Group

Copyright © 2006-2019

 

Sitemap | XML | RSS