Operating System - HP-UX
1833077 Members
2891 Online
110049 Solutions
New Discussion

Does anyone know of a way to scan a file and identify the language?

 
SOLVED
Go to solution
Rich Wright
Trusted Contributor

Does anyone know of a way to scan a file and identify the language?

Does anyone know of a way to scan a file and identify the language?
17 REPLIES 17
Pete Randall
Outstanding Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Compile binary? Source code? Text?


Pete


Pete
Pete Randall
Outstanding Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Sorry:

Compiled binary? Source code? Text?


Pete


Pete
Rich Wright
Trusted Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Sorry for not being clear.
Chinese? Spanish?, etc.
Pete Randall
Outstanding Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Ahh! Plain text, then. Visually via cat or more or somesuch is the only way I can think of.


Pete


Pete
James R. Ferguson
Acclaimed Contributor
Solution

Re: Does anyone know of a way to scan a file and identify the language?

Hi:

# file filename

Regards!

...JRF...
Pete Randall
Outstanding Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Thanks, James! Another new trick that I should have known.


Pete


Pete
Rich Wright
Trusted Contributor

Re: Does anyone know of a way to scan a file and identify the language?

I'm thinking of a program that would do some sort of dictionary match against a text file to identify it's origin as Chinese, Spanish, French, German, etc.
Shannon Petry
Honored Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Well, since source code can be so strange, I'd guess that it's not possible to do, but close.

I think that the header line is critical for scripting. I.E.
for FILE in `ls /usr/local/scripts` ; do
TEST=`cat $FILE|grep ^#|head -1`
case $TEST in
*ksh) SCR_LANG="korn" ;;
*csh) SCR_LANG="c-shell" ;;
*perl*) SCR_LANG="perl" ;;
*sh) SCR_LANG=borne" ;;
*) SCR_LANG="I have no clue" ;;
esac

^^Order is critical, as with *sh first ksh and csh would be considered borne.

Okay, so this part was easy, but now you get to compiled languages. My guess is that your looking for souce C/C++/Fortran/Cobol/etc...?

Been a while since Cobol, so my best guess is to "grep table $FILE", as cobol does not use arrays like other languages. If there is a table defined, it's Cobol.

The rest is very very difficult. Your better to go by extension. Why?

C++, C, and Fortran are very similar. Depending on the code, the same basic functions can look the same between C and C++, similar with fortran and pascal.

This would leave looking at include statements for what language it is.

How many differences between #include and #define are there between C and C++? None.

Regards,
Shannon
Microsoft. When do you want a virus today?
Shannon Petry
Honored Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Bah, that's what I get for trying to guess what you were looking for ;/

The thing is... you cant reall do this. If you run "strings $FILENAME", it should return embedded ascii in files. However, courier is the same in any language, so you cant tell where it's from.


Shannon
Microsoft. When do you want a virus today?
Rich Wright
Trusted Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Looks like "file" may do it.
From "man file"
...
file performs a series of tests on each file in an attempt to classify it. If file appears to be an ASCII file, file examines the first 512 bytes and tries to guess its language.
...
If by "language" they mean Spanish, German, etc. I would expect that all available Language sets SW would have to be installed.
If anyone has done this on HP-UX, I would like to know. I tried a brief test, ("Uno, Dos, Tres" -or- "Eins, Swei, Drei"), with no success. "file" only identified as "ascii text".
harry d brown jr
Honored Contributor

Re: Does anyone know of a way to scan a file and identify the language?

I'm sure the NSA can help ;-)

If it's chinese, then it's more likely "unicoded".

Google.com translates pages.

What is the source of these files?

live free or die
harry
Live Free or Die
Rich Wright
Trusted Contributor

Re: Does anyone know of a way to scan a file and identify the language?

This is for a global manufacturing company that is looking into a data warehousing need.
harry d brown jr
Honored Contributor

Re: Does anyone know of a way to scan a file and identify the language?

This is why XML is important in B2B.

live free or die
harry
Live Free or Die
James R. Ferguson
Acclaimed Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Hi (again) Rich:

No, 'file' attempts to identify the computer programming "language" as for instance:

# file mycode.c

...might report:

c program text

For what you want you might 'grep' and count ('wc') words that would, with some degree of reasonable probablity, be associated with a particular human-form language.

Regards!

...JRF..
Pete Randall
Outstanding Contributor

Re: Does anyone know of a way to scan a file and identify the language?

So, basically, we're back to visually.?! Darn, I thought from the man page that file would do exactly what Rich wanted. The term language is extremely ambiguous in this environment.


Pete


Pete
A. Clay Stephenson
Acclaimed Contributor

Re: Does anyone know of a way to scan a file and identify the language?

There does seem to be a method. There is a Lingua::Ident Perl module that does exactly this. It compares input text to a statistical database and out pops the language. There is a command (trainlid) to build the database using sample text.

http://search.cpan.org/author/MPIOTR/Lingua-Ident-1.4/

If it ain't broke, I can fix that.
Volker Borowski
Honored Contributor

Re: Does anyone know of a way to scan a file and identify the language?

Cool question,

I found an old contest at

http://www.bwinf.de/ (sorry, in german)

(Contest 19, question 3)

Unfortunately, there are no source codes in the archive.

Basicly the given solution hints suggest to scan for special characters unique to a language or for so called character-key-sequences or key-words. To avoid the special char-problem inside this forum, here is the complete link, from which you might consider only to use the table in the middle. I did not try to "babelfish" this page to english, because I think the table itself gives you a clue.
http://www.bwinf.de/archiv/bwi19/runde1/l13main.html

Volker