cancel
Showing results for 
Search instead for 
Did you mean: 

Zip files 1.7 Million files

SOLVED
Go to solution
ssheri
Advisor

Zip files 1.7 Million files

Hi,

I have a filesystem which has got 1.7 million files. This filesystem contains files since Jan 2005.I need to gzip all files till Dec2008. I need to do this in an yearly basis. ie one zip file for 2005 , one zip file for 2006 . Same procedure for rest of the years.

Can anyone help me with the commands for this task?

Your help is much appreciated.
18 REPLIES
OldSchool
Honored Contributor

Re: Zip files 1.7 Million files

do the filenames have the date in them somewhere? or are you relying on the datestamps? or something else entirely?

Suraj K Sankari
Honored Contributor

Re: Zip files 1.7 Million files

HI,
First make a tar file then zip it with compress or gzip utility.

tar -cvf 2005.tar /directiory_name
gzip 2005.tar

Suraj
Michael Steele_2
Honored Contributor

Re: Zip files 1.7 Million files

cd dir
find . -atime 360 -exec ll {} \;

Verify your selection by listing everything captured by find

When ready

find . -atime 360 -exec gzip {} \;

This is for one year. 720 for two years, etc.

I'd also suggest using 'tar' after you gzip else you'll run out of space fast. Real, real fast. In fact, having another dir to work with would be good.

find . -name *.gz | tar cvf backup.tar {} \;
Support Fatherhood - Stop Family Law
OldSchool
Honored Contributor

Re: Zip files 1.7 Million files

"tar -cvf 2005.tar /directiory_name
gzip 2005.tar"

that assumes the OP had the files already segregated into directories by year, which may or may not be the case.


"find . -atime 360 -exec gzip {} \;"

is probably closer to what the OP wants, but will result in one zip file for each original file found...which may be what their after.

or you could take the above "find" and "mv" the file to a separate directory, then gzip each, then tar the results....or mv the file, tar the directory and gzip *that*.

ssheri needs to remember the their is no "create date" stored in unix filesystems, M. Steele is going after the "access time" which would be may be a good bet. see the "man" page for "find", in particular the "-atime", "-mtime" and "-ctime" options to see which best fits.

Another option would be to create two reference files with appropriate dates, and use the "-newer" and "-older" options to sort out what you want.

All of the above is why I originally asked if the date was somehow "buried" in the filename.

some additional information about the original data layout, and the desired results might help in providing more appropriate responses.
Steven Schweda
Honored Contributor

Re: Zip files 1.7 Million files

> Another option would be to create two
> reference files [...]

This seems like a better scheme than any of
the "-time" options. Especially if
you're not running the job at 00:00 on 1
January. "-atime" would seem to be the
least likely to get the desired result
(unless no one ever looks at these files).

> or you could take the above "find" and
> "mv" the file [...]

I'd vote for moving them to year-specific
directories that way, and then doing
something like:

tar cf - year_2005_dir | \
gzip -c > year_2005_dir.tar.gz

Creating an actual "tar" archive file, and
_then_ hitting it with gzip tends to require
more disk space, at least temporarily.

> find . -atime 360 -exec gzip {} \;
>
> This is for one year. 720 for two years,
> etc.

Around here, years are longer than 360 days.
Which calendar do you use? (And which does
"find" use?)
Viktor Balogh
Honored Contributor

Re: Zip files 1.7 Million files

If I would want to separate the files based explicitly on the year, I would go this way to create a file list:

# find . -exec ll {} + | awk '$8 == "2007"' | tee list_2007

This lists the files exactly from year 2007, (1st jan -> 31th dec) and also dives into subdirs. After that you could feed this file to gzip/tar or whatever you want...

****
Unix operates with beer.
OldSchool
Honored Contributor

Re: Zip files 1.7 Million files

lots of options presented.....still waiting for "ssheri" to shed some light on the original directory layout and the desired output.

from what was originally stated, it could well be that the OP wants a gzip file for a given year that contains all the files for that year (as opposed to zipping a tar of those files).

If so, I don't think that option has been covered yet, and it might be a pain to implement.
ssheri
Advisor

Re: Zip files 1.7 Million files

Hi All,
Thanks for your quick responses. I hope I would explain my requirement in detail.
=======================================

I have a filesystem which contains 1.7 million files. File are there since 2005 till today. My requirement is to tar and zip the files for each year separately. ie one tar/zip file for 2005, 2006,2007 and 2008. The files can be identified by their time stamp and there are no separate directories for each year. All files are residing on a single directory.
======================================
OldSchool
Honored Contributor
Solution

Re: Zip files 1.7 Million files

"I have a filesystem which contains 1.7 million files. File are there since 2005 till today. My requirement is to tar and zip the files for each year separately. ie one tar/zip file for 2005, 2006,2007 and 2008. The files can be identified by their time stamp and there are no separate directories for each year. All files are residing on a single directory."

Ok, this could get ugly. Making the assumption that the files will be removed after archiving, then something like the following can be modified to work:

First, you need to realize that UNIX doesn't have / track a file timestamp related to the "creation time". It knows the following:

atime (File Access Time)
Access time shows the last time the data from a file was accessed - read by one of the Unix processes directly or through commands and scripts.

ctime (File Change Time)
ctime changes when you change file's ownership or access permissions. It will also naturally highlight the last time file had its contents updated.

mtime (File Modify Time)
Last modification time shows time of the last change to file's contents. It does not change with owner or permission changes, and is therefore used for tracking the actual changes to data of the file itself.

So...which one you look at depends on what you want. IF you can guarantee that the contents of the file, once written, were never modified, then the mtime option of find should be ok. Access time is useless for this if the file has ever been read after writing. Ctime *might* work.

If none of the above apply, then you're toast, as you've no way to locate files written in 2005.

Let us say that mtime is workable in your case, and you are going to find those files in year 2005. I'd create two reference files representing the upper and lower limits of the times you wish to locate:

touch -a -m -t 200501010000.00 $HOME/first.ref
touch -a -m -t 200512312359.59 $HOME/last.ref

should get be everything between 01/01/2005 at 00:00 and 12/31/2005 at 23:59 and 59 seconds.


then use find to locate the relevant file using find and move them to a directory by themselves

mkdir /yourname/2005
cd /where_files_are

find . -xdev -type f -newer $HOME/first.ref -a !-newer $HOME/last.ref -exec mv {} /yourname/2005/. \+

at that point, you should be able to tar the newly created directory and pipe that to zip as noted in one of the posts above.

Note that the above has not been tested, you might want to substitute something harmless, like ls for the move until you get it sorted out.

repeat the above, after adjusting timestamps on the ref files, and creating the required directories.





Michael Steele_2
Honored Contributor

Re: Zip files 1.7 Million files

And I have provided the procedure that I would use if I had the task to perform.
Support Fatherhood - Stop Family Law
Dennis Handly
Acclaimed Contributor

Re: Zip files 1.7 Million files

>I have a filesystem which has got 1.7 million files. ... All files are residing on a single directory.

I assume people have told you this is not a good idea?

>The files can be identified by their time stamp

Encoded in their name, or in the ll(1) output?
I have a case where they are encoded in their name.

>there are no separate directories for each year.

If the names include the year, the first thing to do is to create a subdirectory and move all of a year into it.

If they don't include the year, you can make a simple script to do that:
last_year=""
ll -trog | while read F1 F2 F3 F4 F5 F6 F7; do
case $F6 in
200[5-8]) ;;
*) continue;;
esac
if [ "$F6" != "$last_year" ]; then
mkdir -p $F6
last_year=$F6
fi
echo mv "$F7" $last_year
done

Once they are in a separate directory, you use the tar-gzip suggestions as Steven suggested.

If you don't want to include the directory name in the tarball, you can use -C:
tar cf - -C 2005 . |
gzip > year_2005.tar.gz
OldSchool
Honored Contributor

Re: Zip files 1.7 Million files

perhaps the biggest problem here is that questions to the OP get a restatment of the original question, without additional information. and specific questions go unanswered.

there are a variety of answers posted, some of which may be more appropriate than others, depending on the exact goal, which isn't clear here.

I will add that if ssheri has any control of the creation of these files, going I'd encode the creation date in the filename somehow, as anything relying on the "timestamps" is not going to be a reliable method for determining when a file was created.
ssheri
Advisor

Re: Zip files 1.7 Million files

Thanks a lot..

Your suggestions match my requirement.
The files are not modified after their arrival to the filesystem. These files are getting saved to the filesystem as a result of a scheduled job. Later these are not getting modified by any user.

ssheri
Advisor

Re: Zip files 1.7 Million files

Hi,

I have checked up using the options which "oldschool" provided.
=======================================
1. created refernec files

2. ran find . -xdev -type f -newer $HOME/first.ref -a !-newer $HOME/last.ref -exec cp -p {} /yourname/2005/. \+

======================================

I have used cp instead of mv for resting. Test was only for 1 month data. But I am getting an error when I excecute it.

cp:./filename: not a directory, where filename is the last file whcih supposed to be copied as per the reference file.

For example if use
touch -a -m -t 200510010000.00 $HOME/first.ref
touch -a -m -t 200511302359.59 $HOME/last.ref

the above error is coming up with last file dated 20051130. I tried changing the date for touch and I am getting the same error for the last file created on the date as per last.ref.



OldSchool
Honored Contributor

Re: Zip files 1.7 Million files

what happens with

find . -xdev -type f -newer $HOME/first.ref -a !-newer $HOME/last.ref -exec ls -l {} \+
ssheri
Advisor

Re: Zip files 1.7 Million files

It works fine. I am getting the error with last file in a purticular month as per the last.ref file when I do a cp instead ll.
Dennis Handly
Acclaimed Contributor

Re: Zip files 1.7 Million files

>2. find . -xdev -type f -newer $HOME/first.ref -a !-newer $HOME/last.ref -exec cp -p {} /yourname/2005/. \+

(I would forget about cp instead of mv for a million files.)

>-exec cp -p {} /yourname/2005/. \+

You can't do this. You can only put the {} last. (Unless GNU find does it?)
(Also leave off the stinkin' \ before the "+".)

So you'll need to write a script:
-exec cp_stuff.sh /yourname/2005 +

Inside cp_stuff.sh you have:
#!/usr/bin/sh
# swap first to last to make "find -exec +" happy.
target=$1
shift
cp -p "$@" "$target"

(The quotes are to make Steven happy with his stinkin' spaces. :-)

>OldSchool: what happens with

Why fiddle with the find syntax when it is the -exec that is the problem?
Also this is what tusc is for.
OldSchool
Honored Contributor

Re: Zip files 1.7 Million files

"Why fiddle with the find syntax when it is the -exec that is the problem?
Also this is what tusc is for."

because it lost enough in translation that I couldn't tell that (from what the OP posted)...

Plus, based on the question, I doubt that the OP has the ability to interpret the output of tusc....