1753777 Members
7903 Online
108799 Solutions
New Discussion юеВ

Re: shell script

 
nick majuras
New Member

shell script

I'm working on a shell script...first of all i'm a teh n00b on linux and shell scrips. This should be a project i've tooked for school. Basicly this script should be a data extractor like those for windows (software that based on a file filled with urls - extract the data you want address, phone no emails etc.) but for linux (since will work a lot faster). looking for help desperate a point to start or something. Thank you
8 REPLIES 8
Steven E. Protter
Exalted Contributor

Re: shell script

This may have been done already

http://hpux.cs.utah.edu/

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Muthukumar_5
Honored Contributor

Re: shell script

Linux shell scripting is being with utils like awk, sed, perl are used for pattern extraction.
If you accomplish on making general format of input lines, it is easy to use shell script.

If you have issues post yours with input and expected output. Lots of shell script lovers are here.
Easy to suggest when don't know about the problem!
Alex Lavrov.
Honored Contributor

Re: shell script

Do you have to use files for data storage? You can use MySQL database. Very simple and much easier to handle the data. When you use plain text files, you should think what's gonna happen if several ppl will update the records, for example ...
I don't give a damn for a man that can only spell a word one way. (M. Twain)
nick majuras
New Member

Re: shell script

all I tryed (i've looked searched asked for a preatty long time for a nice solution) was the Grep function. MySql will make it harder, was thinking like this:
file :url.txt (urls saved line-by-line)
now, the script should take each url(this will be with 1 thread 1) and look for the @ for example (emails) extract whats before and after @ till he find brakes, extract the xxx@XXX.xxx and deposit in a file out.txt lets say, in the end i'll have the data extracted. For phones address etc should be about the same..think this is the simplest way possible? maybe not the best.
thank you
Stuart Browne
Honored Contributor

Re: shell script

It sounds as if you're expecting the file to have lines whidh all differ in layout, i.e. no fixed separator between known colums etc..

In which case, I'd be suggesting either awk or perl for their pattern matching strengths.

But mainly, you need to learn what a regular expression is, and at the very least the basics of how they work.

On any unix/linux box shell, type 'man regex', and you should start getting some info about how they work.
One long-haired git at your service...
Muthukumar_5
Honored Contributor

Re: shell script

If you post your url.txt file (atleast with 40 lines) and expected output from that file. We can write script to do this requirement.

We can not attain by explaining in general words about requirement, It should be in technical flow to reach it.
Easy to suggest when don't know about the problem!
renarios
Trusted Contributor

Re: shell script

Hi Nick,

Please post your file (or another example) and specify exactly what you want. There should be dozens of helpfull people eager to help you!

Cheerio,

Renarios
Nothing is more successfull as failure
H.Merijn Brand (procura
Honored Contributor

Re: shell script

use perl and Regexp::Common

URL's and e-mails are very hard to parse if you want to follow the RFC's. Regexp::Common doesn't even support e-mails yet at the moment.

http://search.cpan.org/~abigail/Regexp-Common-2.120/

If you have URL's that you want to fetch

use LWP::Simple;
my $content = get "http://www.there.com/file.html";

So, if you have a file with e-mails, and you do not want to follow the RFC's very close, but just plainly and crudely parse mail adresses:

--8<---
#!/usr/bin/perl

use strict;
use warnings;

@ARGV = ("file.url");
while (<>) {
if (m/<(\S+?\@\S+?)>/) {
print "$1\n";
next;
}
if (m/\b([-\w._]+\@[-\w._]+)\b/) {
print "$1\n";
next;
}
pritnSTDERR "no mail on this line\n";
-->8---

but remember that

"Jonder Bl'├в ┬ж├Г ├Д ├В┬к├Д┬ж"

is a VALID e-mail according to the RFC

Enjoy, Have FU
Enjoy, Have FUN! H.Merijn