Operating System - HP-UX
1820626 Members
1899 Online
109626 Solutions
New Discussion юеВ

Re: Remove Special Characters

 
SOLVED
Go to solution
uform
Frequent Advisor

Remove Special Characters

Hi,

We receive data file from various sources , which we load in our Datamart using shell scripts / SQL Loader. We requested source system to send data files WITHOUT any special characters (! @ # ~ etc..) in data. But in some instances we do receive with spl.characters.

In SQL Loader we use ~ as delimiter and this time data came in with ~ in data and messed up the load.

How do i check for special characters and remove them from data file before the data load ? Data files will be huge in size (700 - 800 MB) , someitmes touches iGB.

1) What is the best method to solve this problem.
2) how to identify all set of spl characters and removes them ?

Please help.

Note: Shell script OR perl script solution preferred.

Thanks
22 REPLIES 22
Sean Dale
Trusted Contributor

Re: Remove Special Characters

I would suggest using perl to read the file and write to a new one. Perl could be used to remove the ~ from each line as it is writing it to the new file. You can then rename the new file, if needed. I think this will do the trick for you:

#!/usr/bin/perl

open(TEXTDUMP,"myfile.txt");
open(SANITIZED,">mynewfile.txt");

while( ) {

chomp($_);
$line = $_

$line =~ s/~//g;
print SANITIZED "$line\n";
}

close(SANITIZED);
close(TEXTDUMP);
Live life everyday
A. Clay Stephenson
Acclaimed Contributor

Re: Remove Special Characters

I would use tr to strip the characters. In this example I am allowing upper and lowercase alpha + numerics + ASCII Linefeeds (Octal 012) and nothing else.

tr -cd "[A-Za-z0-9\012]" < infile > outfile

Note that the 'c' tr option complements so the net effect when combined with the 'd' (delete) option is to delete any characters not explicitly listed in the set.
If it ain't broke, I can fix that.
Steven E. Protter
Exalted Contributor

Re: Remove Special Characters

Shalom,

What special characters to strip depends on what special characters cause problems.

The best method to solve the problem is the one you tried, have correct data submitted or don't process it.

If you think about it any website asking you to set a password can prevent input of characters that cause problems.

That being said, your best bet on such large files is a perl or other shell script.

You have to read the data form one file and write it to another.

You might also want to see if there is an upgrade for your sql loader.

Good Luck,

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Bill Hassell
Honored Contributor

Re: Remove Special Characters

If you truly want to delete the special characters then Clay's solution is the best one. However, if these special characters define fields, you may want to change tr to substitute a space or other character for the special char. tr can delete or substitute.


Bill Hassell, sysadmin
uform
Frequent Advisor

Re: Remove Special Characters

ok , using tr command.. how to
just AVOID all ascii characters
instead on mentioning allowable character set. because then i may need to include
. , - etc..

i also want to know how can i get list of
all spl characters list.
A. Clay Stephenson
Acclaimed Contributor

Re: Remove Special Characters

If I understand your last question then remove the complement option and you will see what you are filtering out:

tr -d "[A-Za-z0-9\012]" < infile > outfile

For readability, you may want to not delete the linefeeds so:

tr -d "[A-Za-z0-9]" < infile > outfile

You can use the output of this command to better tune the first version I listed.
If it ain't broke, I can fix that.
Sandman!
Honored Contributor

Re: Remove Special Characters

How about pre-processing the input data loads before using SQL*Loader to do the bulk loads. Since you already know the field delimiter, search for its occurence within all the data files and remove/replace it with something other than a tilde ~. Afterwards you can insert the tilde as a field delimiter in order to provide SQL*Loader a method to differentiate the columns.

I'm assuming that the data files come as is and you do some processing on your side in order to separate the fields within; which could be a totally wrong assumption.
A. Clay Stephenson
Acclaimed Contributor

Re: Remove Special Characters

Generally a rigorous definition of the problem will almost by default lead to a solution. For example, when you state ASCII you seem to be implying printable ASCII but you should note that the ASCII character set includes many non-printable characters such as BEL, ESC, FF, LF, CR, HT, VT -- in addition to punctuation characters. You should really state what you want (or don't want) and then the solution becomes obvious.
If the answer to the last question is "I don't know" then tyou have more work to do. This lack of precision in your dataset definition may have well contributed to this situation with your input file -- or you may have completely defined the dataset and your instructions were ignored.
If it ain't broke, I can fix that.
uform
Frequent Advisor

Re: Remove Special Characters

Thanks .

i could find a code which was in use years back in our project. later on for some reason they stopped using it.

I'm not a perl programmer, I have no clue what this does, but from the document i guess this does the same what we are discussing here. Also i see a note in the document saying ~ symbol is not handled using this code. How and where to include ~ symbol in this code , so that it get deleted
while running this script.

Haa i see tr (atlast familar one :) )

more FileScan.pl
#!/usr/bin/perl -w

#check the arguments and report error if invalid.
$numArgs = $#ARGV + 1;
$numArgs == 2 or die("error usage: nr ");

#Open input file for reading
open INPUT_FILE, $ARGV[0] or die("Couldn't open input file: $ARGV[1] ");

#Open Output file for writing
open OUTPUT_FILE, ">$ARGV[1]" or die("Couldn't open output file: $ARGV[2] ");

# read one line at a time and write it back to the output file
# after replacing the non printable characters with spaces.
# Note: the contents of the line are held in $_ variable
while () {

tr [\176-\377] [\040-\040];
tr [\016-\037] [\040-\040];
tr [\001-\011] [\040-\040];
tr [\013-\014] [\040-\040];
print OUTPUT_FILE;
}
Sandman!
Honored Contributor

Re: Remove Special Characters

>Also i see a note in the document saying ~ symbol is not handled using this >code. How and where to include ~ symbol in this code , so that it get deleted >while running this script.

The code you pasted does handle tilde (~) characters in the line:

tr [\176-\377] [\040-\040];

Note -> octal number \176 == ~
Bill Hassell
Honored Contributor

Re: Remove Special Characters

The first place to start defining the desired characters (or undesired) is: man ascii

That man page is really useful with both octal and hex versions of all the possible characters in ASCII (note that all 256 bit patterns are defined in ASCII). So in your code snipett:

tr [\176-\377] [\040-\040];
tr [\016-\037] [\040-\040];
tr [\001-\011] [\040-\040];
tr [\013-\014] [\040-\040];

176 through 377 will change ~ and any character with the 8th bit set to a space (040). The 016-137 takes care of codes from SO (shift out) to US (unit separator). 001-011 takes care of SOH (start of header) to HT (horizontal tab) and 013-014 are VT (vertical tab) and NP (new page, aka, formfeed).

So your code does indeed replace (not delete) the unwanted characters with spaces -- a very important consideration.


Bill Hassell, sysadmin
A. Clay Stephenson
Acclaimed Contributor

Re: Remove Special Characters

And since \176 is a "~", it gets translated to a space as well.
If it ain't broke, I can fix that.
Rory R Hammond
Trusted Contributor

Re: Remove Special Characters


tr [:punct:] < filename > converted

will remove the characters listed in your example.

tr [:punct:] "[ *]" < filename > converted

will substitute the characters with space keeping a fixed field format.
There are a 100 ways to do things and 97 of them are right
uform
Frequent Advisor

Re: Remove Special Characters

Thanks everyone. OK I'm getting some idea about this perl script now. one more question though,

man ascii shows 3 sets of outputs
1) Octal - Character
2) Hexadecimal - Character
3) Decimal - Character

i understand
176(oct)=126(Hex)=7E(Dec) all implies ' ~ '.

But where in script its mentioned its octal comparison ? And man ascii doesnt show
oct numbers > 177 , but in script i have till 377 , so what are the remaining characters ?

Sorry if these are too basic questions abt perl scripting ...

Thanks for help.
A. Clay Stephenson
Acclaimed Contributor

Re: Remove Special Characters

You are the one that mentioned you found a comment indicating that "~" was not handled but your Perl code definitely translates "~" \176 to "space" \040.

The characters above 177 octal (127 decimal) have the 8th bit set and are undefined by ASCII. These characters are often used to display graphics (corners, lines, arrows, smiley faces, etc.) but the actual character that is displayed is entirely dependent upon the display device (terminal or printer) and the current character set. In any event, your program simply translates any character with the 8th bit set to 1 to a space.
If it ain't broke, I can fix that.
uform
Frequent Advisor

Re: Remove Special Characters

Yes , i was the one mentioned ~ was not handled in this code (as per the document). But you people cleared
that out saying its handled.

My question was , where in perl script its mentioned its a octal comparision ? meaning if i just say

tr [\40] [\33];

how can one find what comparison is this , as 40 and 33 are avbl both in Hex and Decimal.




Bill Hassell
Honored Contributor

Re: Remove Special Characters

> tr [\40] [\33];

> how can one find what comparison is this , as 40 and 33 are avbl both in Hex and Decimal.

This is not something unique to Perl but the tr command's of variables. If a number starts with 0 (zero) then the number is interpreted as octal and only the numbers 01234567 are valid. So you use the octal table. The convention of using \040 to represent octal 40 carries over from shell handling of numeric values. The man page for tr says:

" The escape character \ can be used as in the shell to remove special
meaning from any character in a string. In addition, \ followed by 1,
2, or 3 octal digits represents the character whose ASCII code is
given by those digits."

The square brackets tell tr to process the enclosed characters as a 'class' where the class can be one or more characters or a range of characters (like [a-z] or [0-9] or a class name such as [:alpha:] or [:upper:]. The man page helps a lot with the tr command (which is the same whether run from the command line or inside a Perl script).


Bill Hassell, sysadmin
uform
Frequent Advisor

Re: Remove Special Characters

Thanks everyone . Appreciate all your help.
one last question , i heard there is a size restriction using pearl script for this purposes.

MEaning our input feed file could be as big as 1 GB , can this pearl script take this
file as input and give me an output file without special characters ? or shud i complicate it by split the input file and reconnect them later ?

Thanks
A. Clay Stephenson
Acclaimed Contributor

Re: Remove Special Characters

No, Perl will handle it in this case because you are not "slurping" the entire file but rather processing it one line at the time. "Slurping" (and that's a real technical term just like "magic cookies") means that you are reading the entire input file into an array and then processing it.
If it ain't broke, I can fix that.
James R. Ferguson
Acclaimed Contributor

Re: Remove Special Characters

Hi:

Your last question was that you "...heard there is a size restriction using pearl [sic] script for this purposes."

No, Perl's limits are those of the underlying operating system.

Regards!

...JRF...
A. Clay Stephenson
Acclaimed Contributor
Solution

Re: Remove Special Characters

... and I should add that even if you were "slurping" it would really depend upon the underlying OS limits such as maxdsiz or ulimit (which are also adjustable) or whether the filesystem supports largefiles (for output) rather than hard-coded built-in Perl limits.
If it ain't broke, I can fix that.
uform
Frequent Advisor

Re: Remove Special Characters

Thank you all.