1752782 Members
5873 Online
108789 Solutions
New Discussion юеВ

Re: sed syntax, help

 
SOLVED
Go to solution
Vittorio_3
Advisor

sed syntax, help

Just wandering anybody can explain me what is the rules behind this syntax, I took it from one sample to learn sed, but bit hard to follow, let say I could not locate in man many things, eg. flag # in 's#' etc..

#Replace last comma(,) in each line with 'and'
sed 's#\(.*\),\([^,]*\)#\1 and\2#'

Tx to all
Best
Dai
6 REPLIES 6
Matti_Kurkela
Honored Contributor
Solution

Re: sed syntax, help

You won't find 's#' in the man page. The "#" is not a flag: it is a delimiter character for the search command.

Normally this syntax is described as s///[flags], but the delimiter character is not fixed to "/": any character that follows after the command character "s" is used as a delimiter. When the same character appears (unescaped) a second time, it means the part ends and the part begins. When it appears the third time, it marks the end of the part. The [flags] part is optional.

This allows you to use whatever character is convenient as a delimiter: if you're matching pathnames, using "/" as a delimiter would require escaping all the non-delimiter slashes with backslashes, making the expression harder to read.

Sed is based on regular expressions, or "regexps" for short. The regexp syntax is used with other tools too, like grep, awk, perl and many others. It does take a bit of effort to learn it: sometimes regexps are half-jokingly called "write-only language", i.e. reading a complicated regexp can be harder than actually designing and writing it.

> sed 's#\(.*\),\([^,]*\)#\1 and\2#'

Let's split this up a little. The first characters are "s#", so this is a search-and-replace expression, using # as a delimiter.

- search for '\(.*\),\([^,]*\)'
- replace with '\1 and\2'
- no options.

In the search expression, \( and \) are not part of the string to be searched. They define sub-expressions for later reference. In this case, they are referred to in the replacement expression.

So, in plain language, the search expression means:
- accept anything up to a comma, and remember that part as sub-expression 1.
- after the comma, take anything that does not include a comma, and remember that part as sub-expression 2.

Since the search expression does not begin with ^ nor end with $, it hasn't been "anchored" to neither the beginning nor the end of line. But there is a "maximal munch rule": unless a limit is specified, a regular expression tries to match the maximal amount of data possible.

So, if there are two commas on the line, everything on the line up to the _last_ comma (not including the comma itself) will be assigned to sub-expression 1, and whatever is after the last comma to sub-expression 2.

In the replacement part, "\1" and "\2" mean "insert whatever was assigned to the corresponding sub-expression".

Clear as mud?

MK
MK
James R. Ferguson
Acclaimed Contributor

Re: sed syntax, help

Hi :

Matti's explanation is simply excellent.

What Matti describes as the "maximal munch rule" is generally spoken of as "greediness". It is worth noting that In languages like Perl, regular expressions can also be optioned to be "lazy"; that is to match only to match the least data possible.

A very short, but important document about regular expressions can be found in the 'regepx(5)' manpages.

Regards!

...JRF...
Vittorio_3
Advisor

Re: sed syntax, help

Clear like day !!!

Tx so much all.

Best
N
Vittorio_3
Advisor

Re: sed syntax, help

Uff...
Yes, got an idea, but it's really take some time to digest, especially p2.

p1..............p2..............p3
sed 's# \(.*\),\([^,]*\) # \1 and\2#'



Tx again
James R. Ferguson
Acclaimed Contributor

Re: sed syntax, help

Hi (again) Dai:

The notation '[^,]' says to match any character except a comma. This is called a non-matching list.

That said, however, given:

# X="line, line, line, another line"

# echo ${X}|sed 's#\(.*\),\([^,]*\)#\1 and\2#'
line, line, line and another line

.. is also produced by:

# echo ${X}|sed 's#\(.*\),\(.*\)#\1 and\2#'
line, line, line and another line

In either case, the regex engine bumps along *greedily* capturing characters until it has to give up a comma it "gobbled" in order to leave it as the second (albeit uncaptured) piece and then a third piece of zero or more characters of any kind.

A better example of greediness and the reason for using a non-matching list is this:

# Y='There is "yin" and "yang" in things'

Now, suppose all we wanted was to print "yin". Compare these:

# echo $Y|perl -nle 'm/(".*")/ and print $1'
"yin" and "yang"

...which isn't what we wanted.

# echo $Y|perl -nle 'm/("[^"]*")/ and print $1'
"yin"

...which is the desired, matched output.

Perl works the same way as 'sed' though I used Perl for its less cluttered syntax. There is no need to escape '(' and ')' when grouping and capturing. In my example, I asking Perl to read STDIN from a pipe and if it can match something bounded in double quotes, capture and print it.

The difference in the two examples underscores the greediness of the regular expression engine.

Regards!

...JRF...








Dennis Handly
Acclaimed Contributor

Re: sed syntax, help

>JRF: important document about regular expressions can be found in the 'regexp(5)' manpages.

Be aware that regexp(5) describe three types of regular expression and pattern matching notations:

1) Basic Regular Expressions, used by sed, vi, ex, grep

2) Extended Regular Expressions, used by awk and egrep

3) Pattern Matching Notation, used by shells and find