Re: Compare files with checksum and differences

Liu Tina · ‎06-29-2010

Shanghai Stock Exchange customer compared two RMS files with recovery unit journaling named FTNTSPAR_NEW.RUJ, FTNTSPAR_OLD.RUJ with the checksum and differences.

They found the two files are same by “checksum” ( used default XOR ):

$ checksum FTNTSPAR_NEW.RUJ
$ show sym checksum$checksum
CHECKSUM$CHECKSUM = "796545170"
$ checksum FTNTSPAR_OLD.RUJ
$ show sym checksum$checksum
CHECKSUM$CHECKSUM = "796545170"

But when they used command “diff”, they found some of differences:
They asked why?

I used different ALGORITHM qualify with checksum command CRC, MD5 and default XOR get the different checksum result:

$ checksum/ALGORITHM=MD5 FTNTSPAR_NEW.RUJ
$ show sym checksum$checksum
CHECKSUM$CHECKSUM = "8CE744016224B055BCA2B88C67D85F81"

$ checksum/ALGORITHM=MD5 FTNTSPAR_old.ruj
$ show sym checksum$checksum
CHECKSUM$CHECKSUM = "F8268D402A770F0B56D6422AA3795020"

$ checksum/ALGORITHM=CRC FTNTSPAR_NEW.RUJ
$ show sym checksum$checksum
CHECKSUM$CHECKSUM = "7F88743C"

$ checksum/ALGORITHM=CRC FTNTSPAR_old.RUJ
$ show sym checksum$checksum
CHECKSUM$CHECKSUM = "98378C08"

Could you tell me:
1.what kind of algorithm we should be used for RMS files with recovery unit journaling?
2. What is the main different with the command “differences” and “checksum”?
3. Dose the “differences” just compare with the text file?
4. When we intend to compare two files what is the rule we should to follow to use to compare files?

Thanks for your help!
Best Regards,
Tina

P Muralidhar Kini · ‎06-29-2010

Hi Tina,

>> used default XOR
Looks like the XOR algorithm used by VMS CHECKSUM can get a collision in
which case even though the files are different they can end up having the same
checksum value.

Check the following link for discussion on a similar topic -
http://www.mofeel.net/1152-comp-os-vms/3903.aspx

Hope this helps.

Regards,
Murali

Let There Be Rock - AC/DC

Steven Schweda · ‎06-29-2010

> But when they used command "diff", they
> found some of differences:
> They asked why?

What did you tell them? I'd tell them that
it's probably because the files differ. If
your files contain more bits than the
checksum, then it must be possible to have
two files which have the same checksum.

http://en.wikipedia.org/wiki/Checksum

> 1. [...]

Used for what, exactly?

> 2. [...]

DIFFERENCES compares two files. CHECKSUM
calculates a checksum for a file.

HELP CHECKSUM
HELP DIFFERENCES

> 3. [...]

DIFFERENCES compares the two files which you
tell it to compare.

> 4. [...]

It depends on why you wish to compare them.

If you wish to be certain that two files
differ (or not) then you should compare them.
For large files, this may take a long time.
Comparing checksums may be faster, but is
less certain. (If the checksums differ, then
the files differ. If the checksums match,
then the files are unlikely to differ, where
the meaning of "unlikely" depends on the data
in the files, and on the checksum algorithm
used).

You left out BACKUP /COMPARE.

P Muralidhar Kini · ‎06-30-2010

Hi Tina,

I carried out a simple test and was able to see a same checksum values for
two different backup savesets.

Check the attachment for the logs.

Regards,
Murali

Let There Be Rock - AC/DC

Steven Schweda · ‎06-30-2010

> Looks like the XOR algorithm used by VMS
> CHECKSUM can get a collision [...]

Which algorithm can _never_ get a collision?

How many different 32-bit checksums are
possible? How many different 33-bit files
are possible?

Steven Schweda · ‎06-30-2010

> [...] was able to see a same checksum
> values for two different backup savesets.

HELP BACKUP /CRC

If your files contains their own checksums,
then you might get more collisions than you
would otherwise, depending on which
algorithm(s) you use.

Shriniketan Bhagwat · ‎06-30-2010

Hi Tina,

I have seen, now a days more use of /ALGORITHM=MD5 qualifier when calculating the checksum of a file. Even for HP PCSI kits /ALGORITHM=MD5 is used to calculate the checksum.

Regards,
Ketan

Shriniketan Bhagwat · ‎06-30-2010

Hi Tina,

Here is the VMS help on checksum.

http://mx.isti.cnr.it/cgi-bin/conan?key=CHECKSUM&explode=yes&title=VMS%20Help&referer=

Regards,
Ketan

Liu Tina · ‎06-30-2010

The customer intend to know if that two files are same or different. They used checksum, it said the files are same. Then they used DCL command "differences". The out put of differences are:

HST103$ diff FTNTSPAR_OLD.RUJ FTNTSPAR_NEW.RUJ

************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
134 CNSH05190010OC? OCOC @å ç ·E? @Gç ¢G? ç ¢ ç ¢ é H
135 CNSH05190010OD? ODOD @å ç ·E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
134 CNSH05190010OC? OCOC @å ç ·E? @Gç ¢G? ç ¢ ç ¢ `auH
135 CNSH05190010OD? ODOD @å ç ·E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
138 CNSH05190036OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ é H
139 CNSH05190036OD? ODOD ç½ èµ E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
138 CNSH05190036OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
139 CNSH05190036OD? ODOD ç½ èµ E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
142 CNSH05190051OC? OCOC `ç ç ·E? @Gç ¢G? ç ¢ ç ¢ é H
143 CNSH05190051OD? ODOD `ç ç ·E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
142 CNSH05190051OC? OCOC `ç ç ·E? @Gç ¢G? ç ¢ ç ¢ `auH
143 CNSH05190051OD? ODOD `ç ç ·E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
146 CNSH05190077OC? OCOC èµ E? @Gç ¢G? ç ¢ ç ¢ é H
147 CNSH05190077OD? ODOD èµ E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
146 CNSH05190077OC? OCOC èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
147 CNSH05190077OD? ODOD èµ E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
150 CNSH05190085OC? OCOC wî é ? @Gç ¢G? ç ¢ ç ¢ é H
151 CNSH05190085OD? ODOD wî é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
150 CNSH05190085OC? OCOC wî é ? @Gç ¢G? ç ¢ ç ¢ `auH
151 CNSH05190085OD? ODOD wî é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
154 CNSH05190119OC? OCOC ?î °é ? @Gç ¢G? ç ¢ ç ¢ é H
155 CNSH05190119OD? ODOD ?î °é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
154 CNSH05190119OC? OCOC ?î °é ? @Gç ¢G? ç ¢ ç ¢ `auH
155 CNSH05190119OD? ODOD ?î °é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
158 CNSH05190135OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ é H
159 CNSH05190135OD? ODOD ç½ èµ E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
158 CNSH05190135OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
159 CNSH05190135OD? ODOD ç½ èµ E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
162 CNSH05190176OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ é H
163 CNSH05190176OD? ODOD ç½ èµ E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
162 CNSH05190176OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
163 CNSH05190176OD? ODOD ç½ èµ E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
166 CNSH05190184OC? OCOC wî é ? @Gç ¢G? ç ¢ ç ¢ é H
167 CNSH05190184OD? ODOD wî é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
166 CNSH05190184OC? OCOC wî é ? @Gç ¢G? ç ¢ ç ¢ `auH
167 CNSH05190184OD? ODOD wî é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
173 CNSH05190291OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ é H
174 CNSH05190291OD? ODOD ç½ èµ E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
173 CNSH05190291OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
174 CNSH05190291OD? ODOD ç½ èµ E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
177 CNSH05190879OC? OCOC ?î °é ? @Gç ¢G? ç ¢ ç ¢ é H
178 CNSH05190879OD? ODOD ?î °é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
177 CNSH05190879OC? OCOC ?î °é ? @Gç ¢G? ç ¢ ç ¢ `auH
178 CNSH05190879OD? ODOD ?î °é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
181 CNSH05191000OC? OCOC èµ E? @Gç ¢G? ç ¢ ç ¢ é H
182 CNSH05191000OD? ODOD èµ E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
181 CNSH05191000OC? OCOC èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
182 CNSH05191000OD? ODOD èµ E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
185 CNSH05191802OC? OCOC ç î ¸é ? @Gç ¢G? ç ¢ ç ¢ é H
186 CNSH05191802OD? ODOD ç î ¸é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
185 CNSH05191802OC? OCOC ç î ¸é ? @Gç ¢G? ç ¢ ç ¢ `auH
186 CNSH05191802OD? ODOD ç î ¸é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
189 CNSH05191810OC? OCOC â ¬?ç ·E? @Gç ¢G? ç ¢ ç ¢ é H
190 CNSH05191810OD? ODOD â ¬?ç ·E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
189 CNSH05191810OC? OCOC â ¬?ç ·E? @Gç ¢G? ç ¢ ç ¢ `auH
190 CNSH05191810OD? ODOD â ¬?ç ·E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
193 CNSH05193006OC? OCOC â ¬?ç ·E? @Gç ¢G? ç ¢ ç ¢ é H
194 CNSH05193006OD? ODOD â ¬?ç ·E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
193 CNSH05193006OC? OCOC â ¬?ç ·E? @Gç ¢G? ç ¢ ç ¢ `auH
194 CNSH05193006OD? ODOD â ¬?ç ·E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
197 CNSH05195084OC? OCOC wî é ? @Gç ¢G? ç ¢ ç ¢ é H
198 CNSH05195084OR? OROR wî é ? @Gç ¢G? ? ? é H
199 CNSH05195084OT? OTOT wî é ? @Gç ¢G? ? ? é H
200 CNSH05195183OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ é H
201 CNSH05195183OR? OROR ç½ èµ E? @Gç ¢G? ? ? é H
202 CNSH05195183OT? OTOT ç½ èµ E? @Gç ¢G? ? ? é H
203 CNSH05195191OC? OCOC ?î °é ? @Gç ¢G? ç ¢ ç ¢ é H
204 CNSH05195191OD? ODOD ?î °é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
197 CNSH05195084OC? OCOC wî é ? @Gç ¢G? ç ¢ ç ¢ `auH
198 CNSH05195084OR? OROR wî é ? @Gç ¢G? ? ? é H
199 CNSH05195084OT? OTOT wî é ? @Gç ¢G? ? ? é H
200 CNSH05195183OC? OCOC ç½ èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
201 CNSH05195183OR? OROR ç½ èµ E? @Gç ¢G? ? ? é H
202 CNSH05195183OT? OTOT ç½ èµ E? @Gç ¢G? ? ? é H
203 CNSH05195191OC? OCOC ?î °é ? @Gç ¢G? ç ¢ ç ¢ `auH
204 CNSH05195191OD? ODOD ?î °é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
207 CNSH05196660OC? OCOC â ¬?ç ·E? @Gç ¢G? ç ¢ ç ¢ é H
208 CNSH05196660OR? OROR â ¬?ç ·E? @Gç ¢G? ? ? é H
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
207 CNSH05196660OC? OCOC â ¬?ç ·E? @Gç ¢G? ç ¢ ç ¢ `auH
208 CNSH05196660OR? OROR â ¬?ç ·E? @Gç ¢G? ? ? é H
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
212 CNSH05196900OC? OCOC ç î ¸é ? @Gç ¢G? ç ¢ ç ¢ é H
213 CNSH05196900OD? ODOD ç î ¸é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
212 CNSH05196900OC? OCOC ç î ¸é ? @Gç ¢G? ç ¢ ç ¢ `auH
213 CNSH05196900OD? ODOD ç î ¸é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
216 CNSH05196926OC? OCOC èµ E? @Gç ¢G? ç ¢ ç ¢ é H
217 CNSH05196926OD? ODOD èµ E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
216 CNSH05196926OC? OCOC èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
217 CNSH05196926OD? ODOD èµ E? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
220 CNSH05199938OC? OCOC wî é ? @Gç ¢G? ç ¢ ç ¢ é H
221 CNSH05199938OD? ODOD wî é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
220 CNSH05199938OC? OCOC wî é ? @Gç ¢G? ç ¢ ç ¢ `auH
221 CNSH05199938OD? ODOD wî é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
224 CNSH05199953OC? OCOC ?î °é ? @Gç ¢G? ç ¢ ç ¢ é H
225 CNSH05199953OD? ODOD ?î °é ? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
224 CNSH05199953OC? OCOC ?î °é ? @Gç ¢G? ç ¢ ç ¢ `auH
225 CNSH05199953OD? ODOD ?î °é ? @Gç ¢G?
************
************
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1
228 CNSH05199979OC? OCOC `% èµ E? @Gç ¢G? ç ¢ ç ¢ é H
229 CNSH05199979OD? ODOD `% èµ E? @Gç ¢G?
******
File TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1
228 CNSH05199979OC? OCOC `% èµ E? @Gç ¢G? ç ¢ ç ¢ `auH
229 CNSH05199979OD? ODOD `% èµ E? @Gç ¢G?
************

Number of difference sections found: 22
Number of difference records found: 28

DIFFERENCES /IGNORE=()/MERGED=1-
TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_OLD.RUJ;1-
TEMP01$:[HP_MSE.CHECKSUM]FTNTSPAR_NEW.RUJ;1

HST103$ lo

SYSTEM logged out at 29-JUN-2010 13:45:00.50
I have put these two files into HP Lab:
RX66B 16.157.8.68, located in :

$ sh def

SYS$SYSROOT:[SYSMGR.TEST.CHECKSUM]

= SYS$SYSROOT:[SYSMGR.TEST.CHECKSUM]

$ dir */size/date=(c,m)

Directory SYS$SYSROOT:[SYSMGR.TEST.CHECKSUM]

FTNTSPAR_NEW.RUJ;1 4848 23-JUN-2010 17:50:27.71 23-JUN-2010 17:50:27.79

FTNTSPAR_OLD.RUJ;1 4848 23-JUN-2010 17:50:20.29 23-JUN-2010 17:50:20.50

Total of 2 files, 9696 blocks.

$

Please feel free to do your test on this OpenVMS module (16.157.8.68, system/newmanager)

Thanks for your help!
Tina

Shriniketan Bhagwat · ‎06-30-2010

Hi Tina,

MD5 algorithm is one of the most common scheme used to calculate checksum other than XOR. Please refer the OpenVMS FAQ from the below link. I would suggest you to use MD5 algorithm instead of default XOR.

http://www.uni-giessen.de/faq/archiv/dec-faq.vms.part1-11/msg00003.html

Regards,
Ketan

Shriniketan Bhagwat · ‎06-30-2010

Hi Tina,

MD5 is a more secure digest or hash or signature algorithm than is the CHECKSUM XOR. I would use MD5 algorithm instead of default XOR.

http://www.faqs.org/rfcs/rfc1321.html

Regards,
Ketan

Hoff · ‎06-30-2010

Ok, I'm addressing a few different issues here. The differences (and which there isn't enough detail for a determination yet), and whether this is an integrity check, or whether there are any requirements for cryptographic security here.

XOR can easily report duplicates, and it's trivial to create cases where it will return dups; to adjust the XOR value to match rwhatever you want.

MD5 is better than XOR against "innocent" and against accidental changes. It's a decent integrity check.

The MD5 digest is not cryptographically secure, having been broken some years ago. This security as differentiated from integrity.

http://www.kb.cert.org/vuls/id/836068

As for (better, more secure) alternatives, OpenVMS has an OpenSSL port installed as part of the CDSA component, and that offers the SHA1 digest.

OpenSSL is expressly intended to create and to later verify cryptographic signatures; it's intended to create and sign and verify and perform related security tasks.

(The VMS OpenSSL documentation is a tad cryptic, no pun intended. But I digress.)

The command to create a digest (on Unix) is:

openssl dgst -sha1 {file}

There's probably also a sha1sum port around for VMS, and that and OpenSSL are intended to produce identical results.

Getting back to the underlying question from the stock exchange, the checksum and MD5 stuff being discussed here akin to the "check engine" light on the dashboard of a car. It tells the driver that there's something out of spec or wrong, but not what.

As for the details of this case and of the files involved, a hexadecimal file dump and a careful look at the byte-level differences will be warranted. This could well be a file transfer error, for instance. But without seeing what the file construction and the data and the history of the data (eg: network ftp transfers, etc) might be, the immediate trigger for the differences isn't clear.

That there's good reason here for the digests to be different is adamantly clear.

Steven Schweda · ‎06-30-2010

> The customer intend to know if that two
> files are same or different. They used
> checksum, it said the files are same.

Read carefully:

> [...] (If the checksums differ, then
> the files differ. If the checksums match,
> then the files are unlikely to differ, where
> the meaning of "unlikely" depends on the data
> in the files, and on the checksum algorithm
> used).

Is some part of that unclear?

Matching checksums is NOT proof that the
files are the same.

> How many different 32-bit checksums are
> possible? How many different 33-bit files
> are possible?

Think about it.

Hein van den Heuvel · ‎06-30-2010

Could you tell me:

1.what kind of algorithm we should be used for RMS files with recovery unit journaling?

DIFFERENCE ... if the file is used with record access and not raw bytes.

But what is really needed, and why do you mention that they are marked for RU? Why do you think that is relevant (it may be).
Is the need for a checksum or for a difference indicator?

This is a highly confusing question to me.

Is the intend to compare data files which happened to me marked for RU journaling and have a name to reflect that, or are these the actual RMS RUJ files used to store RU data into? RMS RUJ files are transient and non-transportable and I see little point in looking at them at all (expect maybe in a QA setting).

What is in the files themselves, and how is the data bing used?

If it use in RECORD mode, such as DIFF uses, then the tool should probably used RECORD access and NOT look at all the bits in the blocks (Convert, areas pre-allocation in indexed files, padding bytes for odd sized record in sequential files),

>> 2. What is the main different with the command â differencesâ and â checksumâ ?

Differences needs the original and will do a byte by byte withing the normal data records bytes, ignoring meta data (if any).
It is the most reliable comparison, and you weapon of choice ... IF you have the original handy.

If you need to compare all bytes, then use backup/compar, or temporarily change the two file attributes to SEQ/FIX/512.

Checksum just creates a lucky number and depending on the data contents that may be an unlucky number.

>> 3. Dose the â differencesâ just compare with the text file?

Binary or text.

>> 4. When we intend to compare two files what is the rule we should to follow to use to compare files?

The main rule is to actually know what you are trying to accomplish.
Verify a file transfer? (Copy, NFS, FTP, backup)
Verify a processing run output?
Verify a transform and back, such as ZIP + UNZIP or BACKUP into a SAVESET and BACK?

From the questions you posted so far, I'd recommend DIFF and if you needed just a single lucky number then make it be CHECKSUM/ALGO=CRC.
The latter may gave false differences much like backup/compate, but that's probably better than the false equal that you proved to suffer with CHECKSUM/FILE.

The data must have had some XOR in there, to make the checksum XOR generate the same value every time.... or you an operational error slipped by.

Hope this helps some,
Hein van den Heuvel
HvdH Performance Consulting.

John Gillings · ‎06-30-2010

Tina,

Consider, in simple terms, the XOR checksum sees the file as a stream of longwords. You just XOR them together, getting a 32 bit result. A lot of the time, any simple change in the file will result in a different checksum, but as Steven has noted a few times, the mathematics mean that there must be a very large number of possible files which generate the same checksum.

In particular, the XOR takes no account of the position of a record in a file. Therefore, take the XOR checksum of any file of any size with records in random order. Now reorder the file any way you like. As long as you don't add, remove or change any records, the checksum will be the same. Same data, different sequence, same checksum. I suspect this may be the case with your files (Try DIFF/PARALLEL - it may make things clearer)

Other checksum algorithms may calculate different values for reordered files, but the mathematics says there MUST be multiple files which generate the same checksum.

Since both DIFF and CHECKSUM, of any algorithm, both read the whole contents of both files, there's no performance benefit to using CHECKSUM, and, as you've noticed, you get false negatives. If you want to know that two files are identical, use DIFF. Indeed, I'd suggest you write a simple program which avoids all the clever resynch logic of DIFF, just read the two files byte by byte until you reach EOF or find a difference. If all you want to know is "same/different" you can stop reading on the first difference. I'd also do a preliminary check on file size, obviously files that are different sizes are different!

(for the pedants... I think the above is really only true for records which are multiples of 32 bits, perhaps XOR wraps? I tested with VMS$PASSWORD_DICTIONARY. Checksums were the same for the indexed file, the same file converted to sequential, and the same file sorted descending).

A crucible of informative mistakes

Steven Schweda · ‎06-30-2010

> [...] there's no performance benefit to
> using CHECKSUM [...]

There is if the files are in different
places, with a slow network connection
between them. Using CHECKSUM allows one to
send the (small) checksum over the (slow)
network instead of the whole (large) file.

Even if all the files are local, if one
wishes to make multiple comparisons against
one file, then it can be faster to get a
checksum for that file than to read it again
for every comparison.

As usual, everything's complicated.

And, in this case, CHECKSUM may buy you
nothing but confusion (especially if you
don't understand how a checksum works).

RBrown_1 · ‎06-30-2010

My checksum test consisted of two simple text files.

File1:
1
2

File2:
2
1

Both have the same checksum.

John McL · ‎06-30-2010

Did everyone notice that the DIFFERENCES output that Tina listed at 8:20pm varied only in a few bytes (4 ?) near the end of the record?

I'm not sure how that played out in calculating the CHECKSUM but I did notice that there were 28 different records, suggesting that for the bytes in question we had 14 pairs of XOR's. It looks like they were on the same values each time, which I think would mean the pairs of XORs would negate each other and therefore produce the same results.

Checksum is a crude tool. Differences is better and in this case I'd use without prioducing any output and a check of $SEVERITY afterwards.

I find the CMS DIFFERENCES command to be even better and it can be used on normal files, rather than CMS generations, if you have the appropriate license.

Jon Pinkley · ‎06-30-2010

JG>I'd suggest you write a simple program which avoids all the clever resynch
JG>logic of DIFF, just read the two files byte by byte until you reach EOF
JG>or find a difference. If all you want to know is "same/different" you can
JG>stop reading on the first difference.

While not as streamlined as a single purpose program, you can tell difference to short circuit on the first difference.

$ DIFFERENCES/MAX_DIFFERENCES=1/OUT=NL: file1 file2

does essentially what John describes, but on a record level, not a byte level. Just check $severity, if $severity .eqs. "1", then the files' record contents were the same, otherwise different.

JG>I'd also do a preliminary check on file size, obviously files that
JG>are different sizes are different!

This is a good optimization if the files are being compared on a byte-by-byte basis. It would also work for files with sequential organization, that have the same record format, otherwise you can get false positives (the records are the same, but the size of the files are different, in other words, differences/max=1/out=nl: would return $severity = "1" but the sizes would be different.

Jon

it depends

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Compare files with checksum and differences

Compare files with checksum and differences