General
cancel
Showing results for 
Search instead for 
Did you mean: 

sort file containing null characters

Victor Feng
Occasional Contributor

sort file containing null characters

Our application team has a file which contains some null characters. After the file is sorted, the size of output file is less than the original size.

# sort -k1,1 -o output.data input.data
# ll
-rwxr-xr-x 1 vfeng techserv 18281639 Sep 2 08:32 input.data
-rw-r--r-- 1 vfeng techserv 16736272 Sep 2 08:42 output.data

I tried this on both our 11iv1 and v2.

I also tried this on Solaris, the sort works well.

For now, my workaround is to remove the null characters with sed. A couple of years ago, somebody reported same issue on AIX. Is this a known bug for HP-UX too?

Victor
7 REPLIES
Hakki Aydin Ucar
Honored Contributor

Re: sort file containing null characters

actually it seems to be working ,but if you want to get rid of null characters with sed OR Perl (in my opinion it is better than sed)
no need to use sort prior to workaround.

Hakki Aydin Ucar
Honored Contributor

Re: sort file containing null characters

did you try it ?

# sed '/^$/d' input.data > output.data
James R. Ferguson
Acclaimed Contributor

Re: sort file containing null characters

Hi Victor:

> Our application team has a file which contains some null characters. After the file is sorted, the size of output file is less than the original size.

A snippet of the first few lines of the input and the output files might be informative. Use something like 'xd file' so we can see things.

> For now, my workaround is to remove the null characters with sed.

Then what are you trying to do? You said that "...after the file is sorted, the size of the output file is less than the original..." Eliminating nulls before the sort would also reduce the file's size.

By the way, constructing a small file with embedded nulls and sorting it doesn't lead to any size change for me (as I would expect).

# cat -etv /tmp/sortme
ab1^@^@^@def 111$
ab2^@^@^@def 222$
ab3^@^@^@def 333$

For example, using a reverse sort for emphasis:

# sort -rk1,1 /tmp/sortme|cat -etv
ab3^@^@^@def 333$
ab2^@^@^@def 222$
ab1^@^@^@def 111$

Regards!

...JRF...

Dennis Handly
Acclaimed Contributor

Re: sort file containing null characters

>file which contains some null characters.

This is not a text file. sort(1) has a WARNING:
For non-text input files, the behaviour is undefined.

>JRF: Eliminating nulls before the sort would also reduce the file's size.

Undefined could mean that any chars in the record after the NUL could be lost.
But your example doesn't show that.
Victor Feng
Occasional Contributor

Re: sort file containing null characters

Well, the null characters in this file are different.

Here is how I noticed the nulls. When I open the file with vi editor, I see following message:
"vopx-extract-rn-am.data" 8930 lines, 18277789 characters (3850 nulls)

18277789 + 3850 = 18281639

I can just type w! to save the file, and the nulls will be removed.

-rwx------ 1 vfeng techserv 18277789 Sep 2 09:34 in.txt

Or I can use sed to redirect input to a output file, and the nulls will be removed too. e.g.
sed 's///g' in.txt > out.txt
sed 's/SOMETHING-NOT-IN-THE-FILE//g' in.txt > out.txt
set '/^$/d' in.txt > out.txt

#ll
-rwx------ 1 vfeng techserv 18281639 Sep 2 09:34 in.txt
-rw-r----- 1 vfeng techserv 18277789 Sep 3 14:57 out.txt

Then sort will work well on out.txt.

Here is a few line of files
AZ010 90001AMEND - POLICY CHANGE 999N KAT
AZ010 90002AMEND - POLICY CHANGE 999N KAT


Victor
James R. Ferguson
Acclaimed Contributor

Re: sort file containing null characters

Hi (again) Victor:

I too can observe that 'vi' and the 'sed' substitution as you used it will eliminate the nulls. In my hands, either on an 11.11 or an 11.31 machine, the 'sort' *fails* to cause the loss of characters.

While I can accept 'vi' eliminating the null characters (because it warns you that they are present), I do not agree with 'sed's behavior when one does:

# sed -e '/^$/d'

This should eliminate lines consisting only of a newline --- i.e. an "empty" line, in my opinion. I observe the same behavior you do.

> Here is a few line of files

This isn't helpful. If you used 'cat -etv' or 'xd' to list the file(s) we could see where null characters occur. This is why I used it in my examples.

Regards!

...JRF...

Dennis Handly
Acclaimed Contributor

Re: sort file containing null characters

>After the file is sorted, the size of output file is less than the original size.

From your numbers, it seems it is a lot less. 1.5 M vs 3.8 K

>-rwxr-xr-x 18281639 Sep 2 08:32 input.data

(It isn't a good idea to have data files be executable.)

>my workaround is to remove the null characters with sed.

Can you compare the sorted files you get by using sort directly and then sort on the file where you removed the NULs? Also use wc(1) on each.

That might indicate whether records are missing, or just parts of lines.