topic Re: Remove Special Characters in Operating System - HP-UX

Remove Special Characters

uform — Tue, 15 Aug 2006 13:18:47 GMT

Hi,

We receive data file from various sources , which we load in our Datamart using shell scripts / SQL Loader. We requested source system to send data files WITHOUT any special characters (! @ # ~ etc..) in data. But in some instances we do receive with spl.characters.

In SQL Loader we use ~ as delimiter and this time data came in with ~ in data and messed up the load.

How do i check for special characters and remove them from data file before the data load ? Data files will be huge in size (700 - 800 MB) , someitmes touches iGB.

1) What is the best method to solve this problem.
2) how to identify all set of spl characters and removes them ?

Please help.

Note: Shell script OR perl script solution preferred.

Thanks

Re: Remove Special Characters

Sean Dale — Tue, 15 Aug 2006 13:25:42 GMT

I would suggest using perl to read the file and write to a new one. Perl could be used to remove the ~ from each line as it is writing it to the new file. You can then rename the new file, if needed. I think this will do the trick for you:

#!/usr/bin/perl

open(TEXTDUMP,"myfile.txt");
open(SANITIZED,">mynewfile.txt");

while( ) {

chomp($_);
$line = $_

$line =~ s/~//g;
print SANITIZED "$line\n";
}

close(SANITIZED);
close(TEXTDUMP);

Re: Remove Special Characters

A. Clay Stephenson — Tue, 15 Aug 2006 13:29:35 GMT

I would use tr to strip the characters. In this example I am allowing upper and lowercase alpha + numerics + ASCII Linefeeds (Octal 012) and nothing else.

tr -cd "[A-Za-z0-9\012]" < infile > outfile

Note that the 'c' tr option complements so the net effect when combined with the 'd' (delete) option is to delete any characters not explicitly listed in the set.

Re: Remove Special Characters

Steven E. Protter — Tue, 15 Aug 2006 13:32:52 GMT

Shalom,

What special characters to strip depends on what special characters cause problems.

The best method to solve the problem is the one you tried, have correct data submitted or don't process it.

If you think about it any website asking you to set a password can prevent input of characters that cause problems.

That being said, your best bet on such large files is a perl or other shell script.

You have to read the data form one file and write it to another.

You might also want to see if there is an upgrade for your sql loader.

Good Luck,

SEP

Re: Remove Special Characters

Bill Hassell — Tue, 15 Aug 2006 15:11:19 GMT

If you truly want to delete the special characters then Clay's solution is the best one. However, if these special characters define fields, you may want to change tr to substitute a space or other character for the special char. tr can delete or substitute.

Re: Remove Special Characters

uform — Tue, 15 Aug 2006 15:31:22 GMT

ok , using tr command.. how to
just AVOID all ascii characters
instead on mentioning allowable character set. because then i may need to include
. , - etc..

i also want to know how can i get list of
all spl characters list.

Re: Remove Special Characters

A. Clay Stephenson — Tue, 15 Aug 2006 15:44:23 GMT

If I understand your last question then remove the complement option and you will see what you are filtering out:

tr -d "[A-Za-z0-9\012]" < infile > outfile

For readability, you may want to not delete the linefeeds so:

tr -d "[A-Za-z0-9]" < infile > outfile

You can use the output of this command to better tune the first version I listed.

Re: Remove Special Characters

Sandman! — Tue, 15 Aug 2006 15:57:25 GMT

How about pre-processing the input data loads before using SQL*Loader to do the bulk loads. Since you already know the field delimiter, search for its occurence within all the data files and remove/replace it with something other than a tilde ~. Afterwards you can insert the tilde as a field delimiter in order to provide SQL*Loader a method to differentiate the columns.

I'm assuming that the data files come as is and you do some processing on your side in order to separate the fields within; which could be a totally wrong assumption.

Re: Remove Special Characters

A. Clay Stephenson — Tue, 15 Aug 2006 16:06:00 GMT

Generally a rigorous definition of the problem will almost by default lead to a solution. For example, when you state ASCII you seem to be implying printable ASCII but you should note that the ASCII character set includes many non-printable characters such as BEL, ESC, FF, LF, CR, HT, VT -- in addition to punctuation characters. You should really state what you want (or don't want) and then the solution becomes obvious.
If the answer to the last question is "I don't know" then tyou have more work to do. This lack of precision in your dataset definition may have well contributed to this situation with your input file -- or you may have completely defined the dataset and your instructions were ignored.

Re: Remove Special Characters

uform — Tue, 15 Aug 2006 16:28:07 GMT

Thanks .

i could find a code which was in use years back in our project. later on for some reason they stopped using it.

I'm not a perl programmer, I have no clue what this does, but from the document i guess this does the same what we are discussing here. Also i see a note in the document saying ~ symbol is not handled using this code. How and where to include ~ symbol in this code , so that it get deleted
while running this script.

Haa i see tr (atlast familar one :) )

more FileScan.pl
#!/usr/bin/perl -w

#check the arguments and report error if invalid.
$numArgs = $#ARGV + 1;
$numArgs == 2 or die("error usage: nr ");

#Open input file for reading
open INPUT_FILE, $ARGV[0] or die("Couldn't open input file: $ARGV[1] ");

#Open Output file for writing
open OUTPUT_FILE, ">$ARGV[1]" or die("Couldn't open output file: $ARGV[2] ");

# read one line at a time and write it back to the output file
# after replacing the non printable characters with spaces.
# Note: the contents of the line are held in $_ variable
while () {

tr [\176-\377] [\040-\040];
tr [\016-\037] [\040-\040];
tr [\001-\011] [\040-\040];
tr [\013-\014] [\040-\040];
print OUTPUT_FILE;
}

Re: Remove Special Characters

Sandman! — Tue, 15 Aug 2006 16:46:51 GMT

>Also i see a note in the document saying ~ symbol is not handled using this >code. How and where to include ~ symbol in this code , so that it get deleted >while running this script.

The code you pasted does handle tilde (~) characters in the line:

tr [\176-\377] [\040-\040];

Note -> octal number \176 == ~

Re: Remove Special Characters

Bill Hassell — Tue, 15 Aug 2006 16:52:52 GMT

The first place to start defining the desired characters (or undesired) is: man ascii

That man page is really useful with both octal and hex versions of all the possible characters in ASCII (note that all 256 bit patterns are defined in ASCII). So in your code snipett:

tr [\176-\377] [\040-\040];
tr [\016-\037] [\040-\040];
tr [\001-\011] [\040-\040];
tr [\013-\014] [\040-\040];

176 through 377 will change ~ and any character with the 8th bit set to a space (040). The 016-137 takes care of codes from SO (shift out) to US (unit separator). 001-011 takes care of SOH (start of header) to HT (horizontal tab) and 013-014 are VT (vertical tab) and NP (new page, aka, formfeed).

So your code does indeed replace (not delete) the unwanted characters with spaces -- a very important consideration.

Re: Remove Special Characters

A. Clay Stephenson — Tue, 15 Aug 2006 17:33:05 GMT

And since \176 is a "~", it gets translated to a space as well.

Re: Remove Special Characters

Rory R Hammond — Tue, 15 Aug 2006 20:29:12 GMT

tr [:punct:] < filename > converted

will remove the characters listed in your example.

tr [:punct:] "[ *]" < filename > converted

will substitute the characters with space keeping a fixed field format.

Re: Remove Special Characters

uform — Wed, 16 Aug 2006 09:35:30 GMT

Thanks everyone. OK I'm getting some idea about this perl script now. one more question though,

man ascii shows 3 sets of outputs
1) Octal - Character
2) Hexadecimal - Character
3) Decimal - Character

i understand
176(oct)=126(Hex)=7E(Dec) all implies ' ~ '.

But where in script its mentioned its octal comparison ? And man ascii doesnt show
oct numbers > 177 , but in script i have till 377 , so what are the remaining characters ?

Sorry if these are too basic questions abt perl scripting ...

Thanks for help.

Re: Remove Special Characters

A. Clay Stephenson — Wed, 16 Aug 2006 09:44:56 GMT

You are the one that mentioned you found a comment indicating that "~" was not handled but your Perl code definitely translates "~" \176 to "space" \040.

The characters above 177 octal (127 decimal) have the 8th bit set and are undefined by ASCII. These characters are often used to display graphics (corners, lines, arrows, smiley faces, etc.) but the actual character that is displayed is entirely dependent upon the display device (terminal or printer) and the current character set. In any event, your program simply translates any character with the 8th bit set to 1 to a space.

Re: Remove Special Characters

uform — Wed, 16 Aug 2006 10:21:07 GMT

Yes , i was the one mentioned ~ was not handled in this code (as per the document). But you people cleared
that out saying its handled.

My question was , where in perl script its mentioned its a octal comparision ? meaning if i just say

tr [\40] [\33];

how can one find what comparison is this , as 40 and 33 are avbl both in Hex and Decimal.

Re: Remove Special Characters

Bill Hassell — Wed, 16 Aug 2006 12:04:12 GMT

> tr [\40] [\33];

> how can one find what comparison is this , as 40 and 33 are avbl both in Hex and Decimal.

This is not something unique to Perl but the tr command's of variables. If a number starts with 0 (zero) then the number is interpreted as octal and only the numbers 01234567 are valid. So you use the octal table. The convention of using \040 to represent octal 40 carries over from shell handling of numeric values. The man page for tr says:

" The escape character \ can be used as in the shell to remove special
meaning from any character in a string. In addition, \ followed by 1,
2, or 3 octal digits represents the character whose ASCII code is
given by those digits."

The square brackets tell tr to process the enclosed characters as a 'class' where the class can be one or more characters or a range of characters (like [a-z] or [0-9] or a class name such as [:alpha:] or [:upper:]. The man page helps a lot with the tr command (which is the same whether run from the command line or inside a Perl script).

Re: Remove Special Characters

uform — Fri, 18 Aug 2006 13:52:23 GMT

Thanks everyone . Appreciate all your help.
one last question , i heard there is a size restriction using pearl script for this purposes.

MEaning our input feed file could be as big as 1 GB , can this pearl script take this
file as input and give me an output file without special characters ? or shud i complicate it by split the input file and reconnect them later ?

Thanks

Re: Remove Special Characters

A. Clay Stephenson — Fri, 18 Aug 2006 14:20:48 GMT

No, Perl will handle it in this case because you are not "slurping" the entire file but rather processing it one line at the time. "Slurping" (and that's a real technical term just like "magic cookies") means that you are reading the entire input file into an array and then processing it.