SharpDevelop Community

Get your problems solved!
Welcome to SharpDevelop Community Sign in | Join | Help
in Search

decompressing a zip file with non-English file names

Last post 02-05-2010 5:26 AM by DavidPierson. 16 replies.
Page 1 of 2 (17 items) 1 2 Next >
Sort Posts: Previous Next
  • 01-31-2010 9:57 AM

    decompressing a zip file with non-English file names

     Hi,

    First of all thanks for the great library!

    When decompressing a zip file that contains files with English characters everything works just fine for me, but when I try to decompress a zip that contains characters from other languages (such as Hebrew) I'm getting gibberish.

    Does anyone have any idea what can be the cause? I assume it uses the wrong Encoding to figure decode the characters for the file name, but I'm not sure how to solve it.

    Some more info:
    I'm using Windows 7 and I've created a zip file that contains a file with an Hebrew name (I've used windows to zip it, when I use #ZipLib I get an empty zip file), then I've tried to unzip it using the #ZipLib and got the correct file but with a gibberish name...

    Thanks,

    Moshe

    Filed under: ,
  • 01-31-2010 12:02 PM In reply to

    Re: decompressing a zip file with non-English file names

    Hi Moshe

    Filenames in Zip files rely on the codepage of the creating and extracting systems. Have a look at these references for starters ...

    http://blogs.msdn.com/michkap/archive/2005/05/10/416181.aspx

    http://community.sharpdevelop.net/forums/t/4546.aspx

    http://community.sharpdevelop.net/forums/t/9524.aspx

    Do those help?
    David

  • 01-31-2010 9:13 PM In reply to

    Re: decompressing a zip file with non-English file names

     Hi David,

    Thanks for the quick reply. 

    I've read the interesting references you've wrote there but still have some questions.. I understand that my problem can easily be solved if I know the ecnoding that the zip file was compressed with. The problem is that I don't always know what encoding was used to compress the files. Is there any way of decompress the zip file without knowing the encoding it was compressed with?

    Thank you,

    Moshe

     

  • 02-01-2010 12:04 AM In reply to

    Re: decompressing a zip file with non-English file names

    Hi Moshe

    I'll be honest in that while I am very familiar with UTF, I know very little about Windows Code Page in depth, and I am looking it up on wiki at the moment :-)

    I have looked at the document of this aspect of the Zip file layout, but I do not believe there is an indicator as to what Code Page was used to create the filenames, in the zipfile.

    The assumption was that the same codepage that created it, would be used to unpack it.

    People on this forum who have been both creating and unpacking zips using SharpZip have modified the code to use UTF8, solving the problem. But when the zip is externally sourced, and you do not know what codepage it is, I think this must be difficult.

    I will continue to research but I thought I should give a quick answer  based on what I know already.

  • 02-01-2010 1:45 AM In reply to

    Re: decompressing a zip file with non-English file names

    This comment from the 7-Zip revision history is most interesting ...

    - Unicode (UTF-8) support for filenames in .ZIP archives. Now there are 3 modes:
    1) Default mode: 7-Zip uses UTF-8, if the local code page doesn't contain required symbols.
    2) -mcu switch: 7-Zip uses UTF-8, if there are non-ASCII symbols.
    3) -mcl switch: 7-Zip uses local code page.

    I would like to know:-

    a) When Windows creates a Zip file on a non-ascii machine, what does it put in the filenames?

    b) What do the actual bytes for various local code pages look ike? Are they UTF, or something else?

     

  • 02-01-2010 4:58 AM In reply to

    Re: decompressing a zip file with non-English file names

    Found something I'd missed previously. The standard for Zips allows for UTF8 encoding of filenames. This is notified by setting General Purpose Bit Flag, bit 11 (Language encoding flag (EFS).

    What we need to know is if the zipfiles you are receiving, have this set. Have to leave work now, but this is a promisng direction.

  • 02-01-2010 1:23 PM In reply to

    Re: decompressing a zip file with non-English file names

     Hi David.

    Your effort is very much appreciated :)

    Well, I've checked it and windows defenetly didn't use the UTF8 encoding for the file names encoding - the 11th bit is not set.

    I've looked for a code page that matches the bytes I got (128 for the first Hebrew letter Alef - א) and found that the code page windows used to encode the file names is 862. The DefaultCodePage property (ZipConstants) returns a different code page - 437.

    I've tried to extract the zip file with WinRAR and it did it fine.. so I believe that either there's an indication within the zip file of the encoding, or the default code page property returns a wrong code page.. does it make any sense?

    Thanks,
    Moshe

  • 02-02-2010 7:25 AM In reply to

    Re: decompressing a zip file with non-English file names

    Hi Moshe

    That is really strange. I'll spell out my findings in detail in case I messed up, but to sum up, Windows 7 is creating Zips not using UTF, not using the current Hebrew codepage, but using a code page that has been obsolete since the days of Windows 3.1.

    What I found from wiki etc:-

    • Code page 862 is a very old DOS Hebrew page. Was replaced by Windows-1255 in Windows 3.x and 9x systems. Is now obsolete.
    • It is the only Hebrew codepage with Alef - א in position 128.
    • Windows-1255 is the current Hebrew codepage. It has the Alex - א in position E0 (224). 

    When I read the MS blogs about multilingual they make out as if they are up-to-date and cutting edge. But the reality seems to be very different.

  • 02-02-2010 7:32 AM In reply to

    Re: decompressing a zip file with non-English file names

    sapihes:
    The DefaultCodePage property (ZipConstants) returns a different code page - 437

    That returns the value of "Thread.CurrentThread.CurrentCulture.TextInfo.OEMCodePage" which for me is 850. Again, this is obsolete, and should be 1252 by my understanding. Okay, ANSICodePage returns 1252.  What does ANSICodePage return for you?

    But ... despite all this, somehow, WinRAR manages to get it right. If it wasn't for that, I must throw my hands in the air in despair. But if they can figure it out, so can we.

    Could you try it with 7-Zip ? I would love to know if that does it correctly. It is open source.

    Thanks,
    Dave

    P.S.

    sapihes:
    The DefaultCodePage property (ZipConstants) returns a different code page - 437
    - was that on a Hebrew version of Windoiws?

  • 02-02-2010 7:01 PM In reply to

    Re: decompressing a zip file with non-English file names

    Hi Dave,

    DavidPierson:
    That returns the value of "Thread.CurrentThread.CurrentCulture.TextInfo.OEMCodePage" which for me is 850. Again, this is obsolete, and should be 1252 by my understanding. Okay, ANSICodePage returns 1252.  What does ANSICodePage return for you?


    The ANSICodePage returns 1252 for me to.

    DavidPierson:
    But ... despite all this, somehow, WinRAR manages to get it right. If it wasn't for that, I must throw my hands in the air in despair. But if they can figure it out, so can we.
     

    Well, exactly! that's what i was thinking, and that's why I was wondering if the zip format contains the code page it was saved in..

    DavidPierson:
    was that on a Hebrew version of Windows?
     

    Nope, it's an English version of windows, where the default language for non-unicode characters is Hebrew (I guess that's why the Encoding.Default returns 1255 for me, which is the code page I would expect windows to encode the file names with)

  • 02-02-2010 7:13 PM In reply to

    Re: decompressing a zip file with non-English file names

    Hi,

    I've tried to extract the zip file using the 7-zip and it manages to do it well - The file names are correct...

  • 02-03-2010 1:16 AM In reply to

    Re: decompressing a zip file with non-English file names

    Okay. The standard codepage for Zip filename entries is IBM437. Or, you can decide to use a different codepage when creating and extracting. And there's that UTF bit, which wasn't set in your sample files.

    But there's a third possibility ... the 0x0008 Extra Field. Full info in the Pkware doco in "APPENDIX D - Language Encoding (EFS)".
    Key quote ...

    The 0x0008 Extra Field storage may be used with either setting for general purpose bit 11.  Examples of the intended usage for this field is to store whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC.  Similarly, other commonly used character encoding (code page) designations can be indicated through this field.  Formalized values for use of the 0x0008 record remain undefined at this time.  The definition for the layout of the 0x0008 field will be published when available.  Use of the 0x0008 Extra Field provides for storing data within a ZIP file in an encoding other than IBM Code Page 437 or UTF-8.

    Our code does not look for this field. Can you try this? In the ZipEntry.ProcessExtraData method, add a
    if (extraData.Find(0x0008) {
       // do something
    }

    and see if anything comes up?

  • 02-03-2010 1:20 AM In reply to

    Re: decompressing a zip file with non-English file names

    I'm not real confident that we will find that Extra Data field, so here's something else to try ...

    Set
    ZipConstants.DefaultCodePage = 1255;

    before extracting, and see if it extracts correctly.

     

  • 02-03-2010 10:02 AM In reply to

    Re: decompressing a zip file with non-English file names

    The extra data is empty in my case..
    I've already tried extracting using the 1255 code page but it didn't work :(

    There must be another way of getting the code page.. I hope..
    If I will find something I will let you know.

  • 02-03-2010 11:29 PM In reply to

    Re: decompressing a zip file with non-English file names

    Did you try extracting with

    ZipConstants.DefaultCodePage = 862;

    That should work. If it does, then the assumption is that 7Zip/WinRAR are applying the same codepage. But how do they work out what page? Nowhere can I see that it is recorded in the zipfile itself.

    Well, 7Zip is opensource and we could inspect it I suppose - I'd expect it is in C++.

Page 1 of 2 (17 items) 1 2 Next >
Powered by Community Server (Commercial Edition), by Telligent Systems
Don't contact us via this (fleischfalle@alphasierrapapa.com) email address.