Operating System - HP-UX
1748122 Members
3152 Online
108758 Solutions
New Discussion юеВ

Re: search for number and extract lines below

 
SOLVED
Go to solution
Gyankr
Frequent Advisor

search for number and extract lines below

I have a subtitle in spanish version and i want to extract the timestamps from the file.Each timestamp has a unique number above it as in the attached file.

I would like to know how to extract all the timestamps to a separate file.
10 REPLIES 10
Ganesan R
Honored Contributor
Solution

Re: search for number and extract lines below

Hi,

If your file content is only this and want to extract only the time stamp, you can simply grep it.

#egrep "(00|01)" filename > outputfile
Best wishes,

Ganesh.
Mel Burslan
Honored Contributor

Re: search for number and extract lines below

it looks like the string "-->" is the common factor on these lines. So, this should work:

grep "-->" myfile > timestamps

if you want to remove those timestamp lines and want the rest of the file, then:

grep -v "-->" myfile > mytrimmedfile

hope this helps
________________________________
UNIX because I majored in cryptology...
Gyankr
Frequent Advisor

Re: search for number and extract lines below

thanks to both of you.Now it gets a bit more tricky,once i have extracted those timestamps ,i want to replace these with the ones present in the english version attached.

So what i am trying to do here is to synchronize both my english and spanish subtitles with the same timestamp.

Thanks again.
Mel Burslan
Honored Contributor

Re: search for number and extract lines below

all right then, you will need to fiddle with sed a little. But before this, you need to make sure you have the same number of subtitle frames. Otherwise this will not work.

the construct should be something like this

last=1305
# get this number by manually editing the file
# and finding the last frame sequence

i=1
while [ $i -le $last ]
do

#source is spanish file target is english below

#find the line with timeframe
(( tfl=`grep -n ^$i\$ spanish|cut -d: -f1`+1 ))

#find the timeframe
spanish_time=`cat spa | sed "${tfl}!d"`

#using sed or another tool of your choice
#overwrite spanish_time into english file
#onto the same line
sed "${tfl}c\
${spanish_time}" english > /tmp/tempfile.tmp
#this line above has problems, most probably
#due to contents of variable spanish_time having
#special characters in it
#this is where you need to fiddle with sed man pages and books

mv /tmp/tempfile.tmp english
done # end of while loop


________________________________
UNIX because I majored in cryptology...
Hein van den Heuvel
Honored Contributor

Re: search for number and extract lines below

Gyankr,

Please help us help you.

The provided examples do NOT seem to line up.
My Spanish is almost non existent, but it seems to me that


Sp 12 = 00:02:18,942 --> 00:02:21,850
declarar culpable al acusado.

Corresponds with

En 11 : 00:02:01,350 --> 00:02:04,267
find the accused guilty.

And

Sp 13 = 00:02:21,935 --> 00:02:25,791
Sea cual sea su decisi├Г┬│n, su veredicto deber├Г┬б ser un├Г┬бnime.

Corresponds with

En 12 = 00:02:04,353 --> 00:02:08,220
However you decide, your verdict must be unanimous.

There NO match on the example times/sequence ever. Not even clos. The match on the simple number seems skewed by 1.

So... what is it? Bad examples?
Match regardsless?

Also... how much data can be expected?
For less than a million rows or so, a simple array can be build and files read in sequence.

For more rows, or for a more performant implementation you want an implementation which reads the files more or less in lock step, perhaps 'skipping' out of sequence records.

Regards,
Hein.


OldSchool
Honored Contributor

Re: search for number and extract lines below

worse yet....if the sequential #s are to be believed, you have more entries in the english version than the spanish.

What I had tried was:

given eng.txt and span.txt (provided "subtitle" files), and created eng.ts and span.ts using grep to extract the "-->" lines as outlined prev.

I pasted span.ts and eng.ts thusly:

paste -d"/" span.ts eng.ts > change.ts

adding "s/" to the beginning of each line , and "/" at end, you've a file you can use w/ sed to change span to eng ts

sed -f change.ts span.txt > changed.txt

BUT you have to remove singletons at the end of the file...as apparently their isn't a one-for-one correspondence of dialog stamps in the file.....

without some kind of correspondence, you can't rely on automated matching....
James R. Ferguson
Acclaimed Contributor

Re: search for number and extract lines below

Hi:

OK, as Hein said, the "timestamps" don't seem to match between files. However, if we use the first line of each paragraph as a key (your unique number), then we can substitute the contents of your Spanish file with your English one as below. If a key isn't represented in both files, nothing for that key will be reported.

# cat ./matchup
#!/usr/bin/perl
use strict;
use warnings;
my $file1 = shift or die "File1 expected\n";
my $file2 = shift or die "File2 expected\n";
die "Arguments must be files\n" unless -f $file1 && -f $file2;
my %frame;
{
local $/ = '';
my ( $fh, @a );
open( $fh, '<', $file1 ) or die "Can't open '$file1': $!\n";
while (<$fh>) {
@a = split /\n/;
push( @{ $frame{ $a[0] } }, () );
}
close $fh;
open( $fh, '<', $file2 ) or die "Can't open '$file2': $!\n";
while (<$fh>) {
@a = split /\n/;
if ( exists $frame{ $a[0] } ) {
push( @{ $frame{ $a[0] } }, @a[ 1 .. $#a ] );
}
}
}
for my $key ( sort keys %frame ) {
print join "\n", $key, @{ $frame{$key} }, "\n" if @{ $frame{$key} } > 0;
}
1;

...run as:

# ./matchup file1 file2

Regards!

...JRF...





Sen Hu
New Member

Re: search for number and extract lines below


We will call you Spanish file Spanish.txt and English file English.txt. The goal is to replace the n'th time stamp in the Spanish file with the n'th time stamp in the English file. Mel Burslan rightly pointed out that n'th instance of --> indicates the n'th time stamp. OK so far.





# Read Spanish File into a str variable.
var str Spanish ; cat "Spanish.txt" > $Spanish
# Read English file into a str variable.
var str English ; cat "English.txt" > $English
# Count the instances of --> in Spanish.
var int count ; set $count = { sen "-->" $Spanish }
# Replace one by one.
var int n ; set $n=1
while ($n <= $count)
do
# Get the n'th time stamp from English.
var str timestamp ; set $timestamp = { stex -p -r ("^\n&-->&\n^"+makestr(int($n))) $English }
# Replance the n'th time stamp in Spanish with $timestamp (from English).
sal -r ("^\n&-->&\n^"+makestr(int($n))) $timestamp $Spanish
done
# Write $Spanish back to file.
echo $Spanish > "Spanish.txt"





Please test before using. I have not tested it. I have inserted comments, so you know what each line of the script is doing. Script is in biterscripting ( http://www.biterscripting.com ) .

Sen
Fredrik.eriksson
Valued Contributor

Re: search for number and extract lines below

You can also use "cat -n".


Something like this might work, it's untested thou.
Usage: ./script.sh filename.txt grep-pattern
#!/bin/bash
file=$1
grep=$2
tmp=$(cat -n $file | grep "$grep")
line=$(echo $tmp | awk '{print $1}'
total=$(wc -l $file)
tail=(($total - $line + 1))
head=10 # number of lines to show :)

tail -n+$tail $file | head -n$head

------

As i said, it's untested.

Best regards
Fredrik Eriksson