Re: search for number and extract lines below

Gyankr · ‎06-04-2009

I have a subtitle in spanish version and i want to extract the timestamps from the file.Each timestamp has a unique number above it as in the attached file.

I would like to know how to extract all the timestamps to a separate file.

Ganesan R · ‎06-04-2009

Hi,

If your file content is only this and want to extract only the time stamp, you can simply grep it.

#egrep "(00|01)" filename > outputfile

Best wishes,

Ganesh.

Mel Burslan · ‎06-04-2009

it looks like the string "-->" is the common factor on these lines. So, this should work:

grep "-->" myfile > timestamps

if you want to remove those timestamp lines and want the rest of the file, then:

grep -v "-->" myfile > mytrimmedfile

hope this helps

________________________________
UNIX because I majored in cryptology...

Gyankr · ‎06-04-2009

thanks to both of you.Now it gets a bit more tricky,once i have extracted those timestamps ,i want to replace these with the ones present in the english version attached.

So what i am trying to do here is to synchronize both my english and spanish subtitles with the same timestamp.

Thanks again.

Mel Burslan · ‎06-04-2009

all right then, you will need to fiddle with sed a little. But before this, you need to make sure you have the same number of subtitle frames. Otherwise this will not work.

the construct should be something like this

last=1305
# get this number by manually editing the file
# and finding the last frame sequence

i=1
while [ $i -le $last ]
do

#source is spanish file target is english below

#find the line with timeframe
(( tfl=`grep -n ^$i\$ spanish|cut -d: -f1`+1 ))

#find the timeframe
spanish_time=`cat spa | sed "${tfl}!d"`

#using sed or another tool of your choice
#overwrite spanish_time into english file
#onto the same line
sed "${tfl}c\
${spanish_time}" english > /tmp/tempfile.tmp
#this line above has problems, most probably
#due to contents of variable spanish_time having
#special characters in it
#this is where you need to fiddle with sed man pages and books

mv /tmp/tempfile.tmp english
done # end of while loop

________________________________
UNIX because I majored in cryptology...

Hein van den Heuvel · ‎06-04-2009

Gyankr,

Please help us help you.

The provided examples do NOT seem to line up.
My Spanish is almost non existent, but it seems to me that

Sp 12 = 00:02:18,942 --> 00:02:21,850
declarar culpable al acusado.

Corresponds with

En 11 : 00:02:01,350 --> 00:02:04,267
find the accused guilty.

And

Sp 13 = 00:02:21,935 --> 00:02:25,791
Sea cual sea su decisiÃ³n, su veredicto deberÃ¡ ser unÃ¡nime.

Corresponds with

En 12 = 00:02:04,353 --> 00:02:08,220
However you decide, your verdict must be unanimous.

There NO match on the example times/sequence ever. Not even clos. The match on the simple number seems skewed by 1.

So... what is it? Bad examples?
Match regardsless?

Also... how much data can be expected?
For less than a million rows or so, a simple array can be build and files read in sequence.

For more rows, or for a more performant implementation you want an implementation which reads the files more or less in lock step, perhaps 'skipping' out of sequence records.

Regards,
Hein.

OldSchool · ‎06-04-2009

worse yet....if the sequential #s are to be believed, you have more entries in the english version than the spanish.

What I had tried was:

given eng.txt and span.txt (provided "subtitle" files), and created eng.ts and span.ts using grep to extract the "-->" lines as outlined prev.

I pasted span.ts and eng.ts thusly:

paste -d"/" span.ts eng.ts > change.ts

adding "s/" to the beginning of each line , and "/" at end, you've a file you can use w/ sed to change span to eng ts

sed -f change.ts span.txt > changed.txt

BUT you have to remove singletons at the end of the file...as apparently their isn't a one-for-one correspondence of dialog stamps in the file.....

without some kind of correspondence, you can't rely on automated matching....

James R. Ferguson · ‎06-04-2009

Hi:

OK, as Hein said, the "timestamps" don't seem to match between files. However, if we use the first line of each paragraph as a key (your unique number), then we can substitute the contents of your Spanish file with your English one as below. If a key isn't represented in both files, nothing for that key will be reported.

# cat ./matchup
#!/usr/bin/perl
use strict;
use warnings;
my $file1 = shift or die "File1 expected\n";
my $file2 = shift or die "File2 expected\n";
die "Arguments must be files\n" unless -f $file1 && -f $file2;
my %frame;
{
local $/ = '';
my ( $fh, @a );
open( $fh, '<', $file1 ) or die "Can't open '$file1': $!\n";
while (<$fh>) {
@a = split /\n/;
push( @{ $frame{ $a[0] } }, () );
}
close $fh;
open( $fh, '<', $file2 ) or die "Can't open '$file2': $!\n";
while (<$fh>) {
@a = split /\n/;
if ( exists $frame{ $a[0] } ) {
push( @{ $frame{ $a[0] } }, @a[ 1 .. $#a ] );
}
}
}
for my $key ( sort keys %frame ) {
print join "\n", $key, @{ $frame{$key} }, "\n" if @{ $frame{$key} } > 0;
}
1;

...run as:

# ./matchup file1 file2

Regards!

...JRF...

Sen Hu · ‎06-15-2009

We will call you Spanish file Spanish.txt and English file English.txt. The goal is to replace the n'th time stamp in the Spanish file with the n'th time stamp in the English file. Mel Burslan rightly pointed out that n'th instance of --> indicates the n'th time stamp. OK so far.

# Read Spanish File into a str variable.
var str Spanish ; cat "Spanish.txt" > $Spanish
# Read English file into a str variable.
var str English ; cat "English.txt" > $English
# Count the instances of --> in Spanish.
var int count ; set $count = { sen "-->" $Spanish }
# Replace one by one.
var int n ; set $n=1
while ($n <= $count)
do
# Get the n'th time stamp from English.
var str timestamp ; set $timestamp = { stex -p -r ("^\n&-->&\n^"+makestr(int($n))) $English }
# Replance the n'th time stamp in Spanish with $timestamp (from English).
sal -r ("^\n&-->&\n^"+makestr(int($n))) $timestamp $Spanish
done
# Write $Spanish back to file.
echo $Spanish > "Spanish.txt"

Please test before using. I have not tested it. I have inserted comments, so you know what each line of the script is doing. Script is in biterscripting ( http://www.biterscripting.com ) .

Sen

Fredrik.eriksson · ‎06-17-2009

You can also use "cat -n".

Something like this might work, it's untested thou.
Usage: ./script.sh filename.txt grep-pattern
#!/bin/bash
file=$1
grep=$2
tmp=$(cat -n $file | grep "$grep")
line=$(echo $tmp | awk '{print $1}'
total=$(wc -l $file)
tail=(($total - $line + 1))
head=10 # number of lines to show :)

tail -n+$tail $file | head -n$head

------

As i said, it's untested.

Best regards
Fredrik Eriksson

Gyankr · ‎11-06-2009

Thanks, will look into it

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: search for number and extract lines below

search for number and extract lines below