- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Remove Special Characters
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 06:18 AM
тАО08-15-2006 06:18 AM
We receive data file from various sources , which we load in our Datamart using shell scripts / SQL Loader. We requested source system to send data files WITHOUT any special characters (! @ # ~ etc..) in data. But in some instances we do receive with spl.characters.
In SQL Loader we use ~ as delimiter and this time data came in with ~ in data and messed up the load.
How do i check for special characters and remove them from data file before the data load ? Data files will be huge in size (700 - 800 MB) , someitmes touches iGB.
1) What is the best method to solve this problem.
2) how to identify all set of spl characters and removes them ?
Please help.
Note: Shell script OR perl script solution preferred.
Thanks
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 06:25 AM
тАО08-15-2006 06:25 AM
Re: Remove Special Characters
#!/usr/bin/perl
open(TEXTDUMP,"myfile.txt");
open(SANITIZED,">mynewfile.txt");
while(
chomp($_);
$line = $_
$line =~ s/~//g;
print SANITIZED "$line\n";
}
close(SANITIZED);
close(TEXTDUMP);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 06:29 AM
тАО08-15-2006 06:29 AM
Re: Remove Special Characters
tr -cd "[A-Za-z0-9\012]" < infile > outfile
Note that the 'c' tr option complements so the net effect when combined with the 'd' (delete) option is to delete any characters not explicitly listed in the set.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 06:32 AM
тАО08-15-2006 06:32 AM
Re: Remove Special Characters
What special characters to strip depends on what special characters cause problems.
The best method to solve the problem is the one you tried, have correct data submitted or don't process it.
If you think about it any website asking you to set a password can prevent input of characters that cause problems.
That being said, your best bet on such large files is a perl or other shell script.
You have to read the data form one file and write it to another.
You might also want to see if there is an upgrade for your sql loader.
Good Luck,
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 08:11 AM
тАО08-15-2006 08:11 AM
Re: Remove Special Characters
Bill Hassell, sysadmin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 08:31 AM
тАО08-15-2006 08:31 AM
Re: Remove Special Characters
just AVOID all ascii characters
instead on mentioning allowable character set. because then i may need to include
. , - etc..
i also want to know how can i get list of
all spl characters list.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 08:44 AM
тАО08-15-2006 08:44 AM
Re: Remove Special Characters
tr -d "[A-Za-z0-9\012]" < infile > outfile
For readability, you may want to not delete the linefeeds so:
tr -d "[A-Za-z0-9]" < infile > outfile
You can use the output of this command to better tune the first version I listed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 08:57 AM
тАО08-15-2006 08:57 AM
Re: Remove Special Characters
I'm assuming that the data files come as is and you do some processing on your side in order to separate the fields within; which could be a totally wrong assumption.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 09:06 AM
тАО08-15-2006 09:06 AM
Re: Remove Special Characters
If the answer to the last question is "I don't know" then tyou have more work to do. This lack of precision in your dataset definition may have well contributed to this situation with your input file -- or you may have completely defined the dataset and your instructions were ignored.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 09:28 AM
тАО08-15-2006 09:28 AM
Re: Remove Special Characters
i could find a code which was in use years back in our project. later on for some reason they stopped using it.
I'm not a perl programmer, I have no clue what this does, but from the document i guess this does the same what we are discussing here. Also i see a note in the document saying ~ symbol is not handled using this code. How and where to include ~ symbol in this code , so that it get deleted
while running this script.
Haa i see tr (atlast familar one :) )
more FileScan.pl
#!/usr/bin/perl -w
#check the arguments and report error if invalid.
$numArgs = $#ARGV + 1;
$numArgs == 2 or die("error usage: nr
#Open input file for reading
open INPUT_FILE, $ARGV[0] or die("Couldn't open input file: $ARGV[1] ");
#Open Output file for writing
open OUTPUT_FILE, ">$ARGV[1]" or die("Couldn't open output file: $ARGV[2] ");
# read one line at a time and write it back to the output file
# after replacing the non printable characters with spaces.
# Note: the contents of the line are held in $_ variable
while (
tr [\176-\377] [\040-\040];
tr [\016-\037] [\040-\040];
tr [\001-\011] [\040-\040];
tr [\013-\014] [\040-\040];
print OUTPUT_FILE;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 09:46 AM
тАО08-15-2006 09:46 AM
Re: Remove Special Characters
The code you pasted does handle tilde (~) characters in the line:
tr [\176-\377] [\040-\040];
Note -> octal number \176 == ~
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 09:52 AM
тАО08-15-2006 09:52 AM
Re: Remove Special Characters
That man page is really useful with both octal and hex versions of all the possible characters in ASCII (note that all 256 bit patterns are defined in ASCII). So in your code snipett:
tr [\176-\377] [\040-\040];
tr [\016-\037] [\040-\040];
tr [\001-\011] [\040-\040];
tr [\013-\014] [\040-\040];
176 through 377 will change ~ and any character with the 8th bit set to a space (040). The 016-137 takes care of codes from SO (shift out) to US (unit separator). 001-011 takes care of SOH (start of header) to HT (horizontal tab) and 013-014 are VT (vertical tab) and NP (new page, aka, formfeed).
So your code does indeed replace (not delete) the unwanted characters with spaces -- a very important consideration.
Bill Hassell, sysadmin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 10:33 AM
тАО08-15-2006 10:33 AM
Re: Remove Special Characters
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2006 01:29 PM
тАО08-15-2006 01:29 PM
Re: Remove Special Characters
tr [:punct:] < filename > converted
will remove the characters listed in your example.
tr [:punct:] "[ *]" < filename > converted
will substitute the characters with space keeping a fixed field format.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-16-2006 02:35 AM
тАО08-16-2006 02:35 AM
Re: Remove Special Characters
man ascii shows 3 sets of outputs
1) Octal - Character
2) Hexadecimal - Character
3) Decimal - Character
i understand
176(oct)=126(Hex)=7E(Dec) all implies ' ~ '.
But where in script its mentioned its octal comparison ? And man ascii doesnt show
oct numbers > 177 , but in script i have till 377 , so what are the remaining characters ?
Sorry if these are too basic questions abt perl scripting ...
Thanks for help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-16-2006 02:44 AM
тАО08-16-2006 02:44 AM
Re: Remove Special Characters
The characters above 177 octal (127 decimal) have the 8th bit set and are undefined by ASCII. These characters are often used to display graphics (corners, lines, arrows, smiley faces, etc.) but the actual character that is displayed is entirely dependent upon the display device (terminal or printer) and the current character set. In any event, your program simply translates any character with the 8th bit set to 1 to a space.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-16-2006 03:21 AM
тАО08-16-2006 03:21 AM
Re: Remove Special Characters
that out saying its handled.
My question was , where in perl script its mentioned its a octal comparision ? meaning if i just say
tr [\40] [\33];
how can one find what comparison is this , as 40 and 33 are avbl both in Hex and Decimal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-16-2006 05:04 AM
тАО08-16-2006 05:04 AM
Re: Remove Special Characters
> how can one find what comparison is this , as 40 and 33 are avbl both in Hex and Decimal.
This is not something unique to Perl but the tr command's of variables. If a number starts with 0 (zero) then the number is interpreted as octal and only the numbers 01234567 are valid. So you use the octal table. The convention of using \040 to represent octal 40 carries over from shell handling of numeric values. The man page for tr says:
" The escape character \ can be used as in the shell to remove special
meaning from any character in a string. In addition, \ followed by 1,
2, or 3 octal digits represents the character whose ASCII code is
given by those digits."
The square brackets tell tr to process the enclosed characters as a 'class' where the class can be one or more characters or a range of characters (like [a-z] or [0-9] or a class name such as [:alpha:] or [:upper:]. The man page helps a lot with the tr command (which is the same whether run from the command line or inside a Perl script).
Bill Hassell, sysadmin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-18-2006 06:52 AM
тАО08-18-2006 06:52 AM
Re: Remove Special Characters
one last question , i heard there is a size restriction using pearl script for this purposes.
MEaning our input feed file could be as big as 1 GB , can this pearl script take this
file as input and give me an output file without special characters ? or shud i complicate it by split the input file and reconnect them later ?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-18-2006 07:20 AM
тАО08-18-2006 07:20 AM
Re: Remove Special Characters
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-18-2006 07:21 AM
тАО08-18-2006 07:21 AM
Re: Remove Special Characters
Your last question was that you "...heard there is a size restriction using pearl [sic] script for this purposes."
No, Perl's limits are those of the underlying operating system.
Regards!
...JRF...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-18-2006 07:25 AM
тАО08-18-2006 07:25 AM
Solution- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-18-2006 08:02 AM
тАО08-18-2006 08:02 AM