1
Vote

Header not repeatable for the same compressed data

description

Hi,
 
I am using sevenzipsharp to compress large amount of data and then compare the data.
For the comparison, I am using a hash (SHA1) for the value.
My problem is that if I compress twice the same data then the binary compressed stream are not identical.
There are 2 fields in the header that are changing all the time, even though the data inside is exactly the same.
Hence, my Hash is always different which is not very useful.
 
According to 7zformat.txt:
 
SignatureHeader
~~~~~~~~~~~~~~~
BYTE kSignature[6] = {'7', 'z', 0xBC, 0xAF, 0x27, 0x1C};
 
ArchiveVersion
{
BYTE Major;   // now = 0
BYTE Minor;   // now = 2
};
 
UINT32 StartHeaderCRC;
 
StartHeader
{
REAL_UINT64 NextHeaderOffset
REAL_UINT64 NextHeaderSize
UINT32 NextHeaderCRC
}
 
 
I have identified the problematic fields to be StartHeaderCRC and NextHeaderCRC.
I guess the StartHeaderCRC is different because the NextheaderCRC is also different.
Now the question is why the NextHeaderCRC is different? The rest of the binary stream is stricly identical!?!
 
below is a comparison side by side of the begining of the 2 binary streams. I have starred lines that are different.
[0]: 55      [0]: 55
[1]: 122     [1]: 122
[2]: 188     [2]: 188
[3]: 175     [3]: 175
[4]: 39      [4]: 39
[5]: 28      [5]: 28
[6]: 0       [6]: 0
[7]: 3       [7]: 3
*[8]: 219 [8]: 138
*[9]: 14 [9]: 74
*[10]: 230 [10]: 127
*[11]: 232 [11]: 66
[12]: 184    [12]: 184
[13]: 0      [13]: 0
[14]: 0      [14]: 0
[15]: 0      [15]: 0
[16]: 0      [16]: 0
[17]: 0      [17]: 0
[18]: 0      [18]: 0
[19]: 0      [19]: 0
[20]: 80     [20]: 80
[21]: 0      [21]: 0
[22]: 0      [22]: 0
[23]: 0      [23]: 0
[24]: 0      [24]: 0
[25]: 0      [25]: 0
[26]: 0      [26]: 0
[27]: 0      [27]: 0
*[28]: 167 [28]: 92
*[29]: 137 [29]: 92
*[30]: 29 [30]: 189
*[31]: 223 [31]: 50
[32]: 0      [32]: 0
[33]: 119    [33]: 119
[34]: 174    [34]: 174
[35]: 211    [35]: 211
[36]: 227    [36]: 227
[37]: 163    [37]: 163
[38]: 249    [38]: 249
[39]: 129    [39]: 129
 
Thanks for your help.
Yo.
 
 
 
 
my code is:
 
byte[] result;
        using (MemoryStream inputStream = new MemoryStream())
        {
            using (MemoryStream outputStream = new MemoryStream())
            {
                if (xmlDocument != null)
                {
                    xmlDocument.Save(inputStream);
                }
                SevenZipCompressor compressor = new SevenZipCompressor();
                compressor.ArchiveFormat = OutArchiveFormat.SevenZip;
                compressor.CompressStream(inputStream, outputStream);
                result = outputStream.ToArray();
            }
        }

comments

yoyovicks wrote May 4, 2011 at 5:24 PM

I finally found the origin of the problem.

In addition to the CRC discrepency mentioned above, I also found that there is difference between the streams toward the end.
This is due to the filetime being recorded there.
Hence the CRC are different.
The reason why the datetime is different is because sevenzipsharp force the date to be Now when we are compressing a stream.
Essentially, SZS can compress a raw binary stream, but then, it has to make it look like a file. So, it gets a default name "default" and a lastwritetime = datetime.now.tofiletime().

If I patch SZS to return an arbitrary fixed date for ItemPropId.LastWriteTime, then the files are now identical which is exactly what I want.

Is it possible to have SZS future version to include a feature to fix the date when we are compressing a raw stream?

Thanks very much.
Yo.

wrote Feb 22, 2013 at 1:16 AM