Operating System - OpenVMS

Re: Compare files with checksum and differences

 
Shriniketan Bhagwat
Trusted Contributor

Re: Compare files with checksum and differences

Hi Tina,

MD5 is a more secure digest or hash or signature algorithm than is the CHECKSUM XOR. I would use MD5 algorithm instead of default XOR.

http://www.faqs.org/rfcs/rfc1321.html


Regards,
Ketan
Hoff
Honored Contributor

Re: Compare files with checksum and differences

Ok, I'm addressing a few different issues here. The differences (and which there isn't enough detail for a determination yet), and whether this is an integrity check, or whether there are any requirements for cryptographic security here.

XOR can easily report duplicates, and it's trivial to create cases where it will return dups; to adjust the XOR value to match rwhatever you want.

MD5 is better than XOR against "innocent" and against accidental changes. It's a decent integrity check.

The MD5 digest is not cryptographically secure, having been broken some years ago. This security as differentiated from integrity.

http://www.kb.cert.org/vuls/id/836068

As for (better, more secure) alternatives, OpenVMS has an OpenSSL port installed as part of the CDSA component, and that offers the SHA1 digest.

OpenSSL is expressly intended to create and to later verify cryptographic signatures; it's intended to create and sign and verify and perform related security tasks.

(The VMS OpenSSL documentation is a tad cryptic, no pun intended. But I digress.)

The command to create a digest (on Unix) is:

openssl dgst -sha1 {file}

There's probably also a sha1sum port around for VMS, and that and OpenSSL are intended to produce identical results.

Getting back to the underlying question from the stock exchange, the checksum and MD5 stuff being discussed here akin to the "check engine" light on the dashboard of a car. It tells the driver that there's something out of spec or wrong, but not what.

As for the details of this case and of the files involved, a hexadecimal file dump and a careful look at the byte-level differences will be warranted. This could well be a file transfer error, for instance. But without seeing what the file construction and the data and the history of the data (eg: network ftp transfers, etc) might be, the immediate trigger for the differences isn't clear.

That there's good reason here for the digests to be different is adamantly clear.
Steven Schweda
Honored Contributor

Re: Compare files with checksum and differences

> The customer intend to know if that two
> files are same or different. They used
> checksum, it said the files are same.

Read carefully:

> [...] (If the checksums differ, then
> the files differ. If the checksums match,
> then the files are unlikely to differ, where
> the meaning of "unlikely" depends on the data
> in the files, and on the checksum algorithm
> used).

Is some part of that unclear?

Matching checksums is NOT proof that the
files are the same.

> How many different 32-bit checksums are
> possible? How many different 33-bit files
> are possible?

Think about it.
Hein van den Heuvel
Honored Contributor

Re: Compare files with checksum and differences

Could you tell me:

1.what kind of algorithm we should be used for RMS files with recovery unit journaling?

DIFFERENCE ... if the file is used with record access and not raw bytes.

But what is really needed, and why do you mention that they are marked for RU? Why do you think that is relevant (it may be).
Is the need for a checksum or for a difference indicator?

This is a highly confusing question to me.

Is the intend to compare data files which happened to me marked for RU journaling and have a name to reflect that, or are these the actual RMS RUJ files used to store RU data into? RMS RUJ files are transient and non-transportable and I see little point in looking at them at all (expect maybe in a QA setting).

What is in the files themselves, and how is the data bing used?

If it use in RECORD mode, such as DIFF uses, then the tool should probably used RECORD access and NOT look at all the bits in the blocks (Convert, areas pre-allocation in indexed files, padding bytes for odd sized record in sequential files),


>> 2. What is the main different with the command â differencesâ and â checksumâ ?

Differences needs the original and will do a byte by byte withing the normal data records bytes, ignoring meta data (if any).
It is the most reliable comparison, and you weapon of choice ... IF you have the original handy.

If you need to compare all bytes, then use backup/compar, or temporarily change the two file attributes to SEQ/FIX/512.

Checksum just creates a lucky number and depending on the data contents that may be an unlucky number.

>> 3. Dose the â differencesâ just compare with the text file?

Binary or text.

>> 4. When we intend to compare two files what is the rule we should to follow to use to compare files?

The main rule is to actually know what you are trying to accomplish.
Verify a file transfer? (Copy, NFS, FTP, backup)
Verify a processing run output?
Verify a transform and back, such as ZIP + UNZIP or BACKUP into a SAVESET and BACK?

From the questions you posted so far, I'd recommend DIFF and if you needed just a single lucky number then make it be CHECKSUM/ALGO=CRC.
The latter may gave false differences much like backup/compate, but that's probably better than the false equal that you proved to suffer with CHECKSUM/FILE.

The data must have had some XOR in there, to make the checksum XOR generate the same value every time.... or you an operational error slipped by.

Hope this helps some,
Hein van den Heuvel
HvdH Performance Consulting.
John Gillings
Honored Contributor

Re: Compare files with checksum and differences

Tina,

Consider, in simple terms, the XOR checksum sees the file as a stream of longwords. You just XOR them together, getting a 32 bit result. A lot of the time, any simple change in the file will result in a different checksum, but as Steven has noted a few times, the mathematics mean that there must be a very large number of possible files which generate the same checksum.

In particular, the XOR takes no account of the position of a record in a file. Therefore, take the XOR checksum of any file of any size with records in random order. Now reorder the file any way you like. As long as you don't add, remove or change any records, the checksum will be the same. Same data, different sequence, same checksum. I suspect this may be the case with your files (Try DIFF/PARALLEL - it may make things clearer)

Other checksum algorithms may calculate different values for reordered files, but the mathematics says there MUST be multiple files which generate the same checksum.

Since both DIFF and CHECKSUM, of any algorithm, both read the whole contents of both files, there's no performance benefit to using CHECKSUM, and, as you've noticed, you get false negatives. If you want to know that two files are identical, use DIFF. Indeed, I'd suggest you write a simple program which avoids all the clever resynch logic of DIFF, just read the two files byte by byte until you reach EOF or find a difference. If all you want to know is "same/different" you can stop reading on the first difference. I'd also do a preliminary check on file size, obviously files that are different sizes are different!

(for the pedants... I think the above is really only true for records which are multiples of 32 bits, perhaps XOR wraps? I tested with VMS$PASSWORD_DICTIONARY. Checksums were the same for the indexed file, the same file converted to sequential, and the same file sorted descending).
A crucible of informative mistakes
Steven Schweda
Honored Contributor

Re: Compare files with checksum and differences

> [...] there's no performance benefit to
> using CHECKSUM [...]

There is if the files are in different
places, with a slow network connection
between them. Using CHECKSUM allows one to
send the (small) checksum over the (slow)
network instead of the whole (large) file.

Even if all the files are local, if one
wishes to make multiple comparisons against
one file, then it can be faster to get a
checksum for that file than to read it again
for every comparison.

As usual, everything's complicated.

And, in this case, CHECKSUM may buy you
nothing but confusion (especially if you
don't understand how a checksum works).
RBrown_1
Trusted Contributor

Re: Compare files with checksum and differences

My checksum test consisted of two simple text files.

File1:
1
2

File2:
2
1

Both have the same checksum.
John McL
Trusted Contributor

Re: Compare files with checksum and differences

Did everyone notice that the DIFFERENCES output that Tina listed at 8:20pm varied only in a few bytes (4 ?) near the end of the record?

I'm not sure how that played out in calculating the CHECKSUM but I did notice that there were 28 different records, suggesting that for the bytes in question we had 14 pairs of XOR's. It looks like they were on the same values each time, which I think would mean the pairs of XORs would negate each other and therefore produce the same results.

Checksum is a crude tool. Differences is better and in this case I'd use without prioducing any output and a check of $SEVERITY afterwards.

I find the CMS DIFFERENCES command to be even better and it can be used on normal files, rather than CMS generations, if you have the appropriate license.
Jon Pinkley
Honored Contributor

Re: Compare files with checksum and differences

JG>I'd suggest you write a simple program which avoids all the clever resynch
JG>logic of DIFF, just read the two files byte by byte until you reach EOF
JG>or find a difference. If all you want to know is "same/different" you can
JG>stop reading on the first difference.

While not as streamlined as a single purpose program, you can tell difference to short circuit on the first difference.

$ DIFFERENCES/MAX_DIFFERENCES=1/OUT=NL: file1 file2

does essentially what John describes, but on a record level, not a byte level. Just check $severity, if $severity .eqs. "1", then the files' record contents were the same, otherwise different.

JG>I'd also do a preliminary check on file size, obviously files that
JG>are different sizes are different!

This is a good optimization if the files are being compared on a byte-by-byte basis. It would also work for files with sequential organization, that have the same record format, otherwise you can get false positives (the records are the same, but the size of the files are different, in other words, differences/max=1/out=nl: would return $severity = "1" but the sizes would be different.

Jon
it depends