Ошибка unicodedecodeerror python

Back to top

Toggle table of contents sidebar

Ошибки при конвертации#

При конвертации между строками и байтами очень важно точно знать, какая
кодировка используется, а также знать о возможностях разных кодировок.

Например, кодировка ASCII не может преобразовать в байты кириллицу:

In [32]: hi_unicode = 'привет'

In [33]: hi_unicode.encode('ascii')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-33-ec69c9fd2dae> in <module>()
----> 1 hi_unicode.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

Аналогично, если строка «привет» преобразована в байты, и попробовать
преобразовать ее в строку с помощью ascii, тоже получим ошибку:

In [34]: hi_unicode = 'привет'

In [35]: hi_bytes = hi_unicode.encode('utf-8')

In [36]: hi_bytes.decode('ascii')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-36-aa0ada5e44e9> in <module>()
----> 1 hi_bytes.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Еще один вариант ошибки, когда используются разные кодировки для
преобразований:

In [37]: de_hi_unicode = 'grüezi'

In [38]: utf_16 = de_hi_unicode.encode('utf-16')

In [39]: utf_16.decode('utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-39-4b4c731e69e4> in <module>()
----> 1 utf_16.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Наличие ошибок — это хорошо. Они явно говорят, в чем проблема.
Хуже, когда получается так:

In [40]: hi_unicode = 'привет'

In [41]: hi_bytes = hi_unicode.encode('utf-8')

In [42]: hi_bytes
Out[42]: b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'

In [43]: hi_bytes.decode('utf-16')
Out[43]: '뿐胑룐닐뗐苑'

Обработка ошибок#

У методов encode и decode есть режимы обработки ошибок, которые
указывают, как реагировать на ошибку преобразования.

Параметр errors в encode#

По умолчанию encode использует режим strict — при возникновении ошибок
кодировки генерируется исключение UnicodeError. Примеры такого поведения
были выше.

Вместо этого режима можно использовать replace, чтобы заменить символ
знаком вопроса:

In [44]: de_hi_unicode = 'grüezi'

In [45]: de_hi_unicode.encode('ascii', 'replace')
Out[45]: b'gr?ezi'

Или namereplace, чтобы заменить символ именем:

In [46]: de_hi_unicode = 'grüezi'

In [47]: de_hi_unicode.encode('ascii', 'namereplace')
Out[47]: b'gr\\N{LATIN SMALL LETTER U WITH DIAERESIS}ezi'

Кроме того, можно полностью игнорировать символы, которые нельзя
закодировать:

In [48]: de_hi_unicode = 'grüezi'

In [49]: de_hi_unicode.encode('ascii', 'ignore')
Out[49]: b'grezi'

Параметр errors в decode#

В методе decode по умолчанию тоже используется режим strict и
генерируется исключение UnicodeDecodeError.

Если изменить режим на ignore, как и в encode, символы будут просто
игнорироваться:

In [50]: de_hi_unicode = 'grüezi'

In [51]: de_hi_utf8 = de_hi_unicode.encode('utf-8')

In [52]: de_hi_utf8
Out[52]: b'gr\xc3\xbcezi'

In [53]: de_hi_utf8.decode('ascii', 'ignore')
Out[53]: 'grezi'

Режим replace заменит символы:

In [54]: de_hi_unicode = 'grüezi'

In [55]: de_hi_utf8 = de_hi_unicode.encode('utf-8')

In [56]: de_hi_utf8.decode('ascii', 'replace')
Out[56]: 'gr��ezi'

We can’t guess what you are trying to do, nor what’s in your code, not what «setting many different codecs» means, nor what u»string» is supposed to do for you.

Please change your code to its initial state so that it reflects as best you can what you are trying to do, run it again, and then edit your question to provide (1) the full traceback and error message that you get (2) snippet encompassing the last statement in your script that appears in the traceback (3) a brief description of what you want the code to do (4) what version of Python you are running.

Edit after details added to question:

(0) Let’s try some transformations on the failing statement:

Original:
print "Error reading file %s"%u"%s/%s"%(folder, f)
Add spaces for reduced illegibility:
print "Error reading file %s" % u"%s/%s" % (folder, f)
Add parentheses to emphasise evaluation order:
print ("Error reading file %s" % u"%s/%s") % (folder, f)
Evaluate the (constant) expression in parentheses:
print u"Error reading file %s/%s" % (folder, f)

Is that really what you intended? Suggestion: construct the path ONCE, using a better method (see point (2) below).

(1) In general, use repr(foo) or "%r" % foo for diagnostics. That way, your diagnostic code is much less likely to cause an exception (as is happening here) AND you avoid ambiguity. Insert the statement print repr(folder), repr(f) before you try to get the size, rerun, and report back.

(2) Don’t make paths by u"%s/%s" % (folder, filename) … use os.path.join(folder, filename)

(3) Don’t have bare excepts, check for known problems. So that unknown problems don’t remain unknown, do something like this:

try:
    some_code()
except ReasonForBaleOutError:
    continue
except: 
    # something's gone wrong, so get diagnostic info
    print repr(interesting_datum_1), repr(interesting_datum_2)
    # ... and get traceback and error message
    raise

A more sophisticated way would involve logging instead of printing, but the above is much better than not knowing what’s going on.

Further edits after rtm(«os.walk»), remembering old legends, and re-reading your code:

(4) os.walk() walks over the whole tree; you don’t need to call it recursively.

(5) If you pass a unicode string to os.walk(), the results (paths, filenames) are reported as unicode. You don’t need all that u»blah» stuff. Then you just have to choose how you display the unicode results.

(6) Removing paths with «$» in them: You must modify the list in situ but your method is dangerous. Try something like this:

for i in xrange(len(folders), -1, -1):
    if '$' in folders[i]:
        del folders[i]

(7) Your refer to files by joining a folder name and a file name. You are using the ORIGINAL folder name; when you rip out the recursion, this won’t work; you’ll need to use the currently-discarded content[0] value reported by os.walk.

(8) You should find yourself using something very simple like:

for folder, subfolders, filenames in os.walk(unicoded_top_folder):

There’s no need for generator = os.walk(...); try: content = generator.next() etc and BTW if you ever need to do generator.next() in the future, use except StopIteration instead of a bare except.

(9) If the caller provides a non-existent folder, no exception is raised, it just does nothing. If the provided folder is exists but is empty, ditto. If you need to distinguish between these two scenarios, you’ll need to do extra testing yourself.

Response to this comment from the OP: «»»Thanks, please read the info repr() has shown in the first post. I don’t know why it printed so many different items, but it looks like they all have problems. And the common thing between all of them is they are .ink files. May that be the problem? Also, in the last ones, the firefox ones, it prints ( Modalitrovvisoria) while the real file name from Explorer contains ( Modalità provvisoria)»»»

(10) Umm that’s not «.INK».lower(), it’s «.LNK».lower() … perhaps you need to change the font in whatever you’re reading that with.

(11) The fact that the «problem» file names all end in «.lnk» /may/ be something to do with os.walk() and/or Windows doing something special with the names of those files.

(12) I repeat here the Python statement that you used to produce that output, with some whitespace introduced :

print repr(
    "Error reading file %s" \
    % u"%s/%s" % (
        folder.decode('utf-8','ignore'),
        f.decode('utf-8','ignore')
        )
    )

It seems that you have not read, or not understood, or just ignored, the advice I gave you in a comment on another answer (and that answerer’s reply): UTF-8 is NOT relevant in the context of file names in a Windows file system.

We are interested in exactly what folder and f refer to. You have trampled all over the evidence by attempting to decode it using UTF-8. You have compounded the obfuscation by using the «ignore» option. Had you used the «replace» option, you would have seen «( Modalit\ufffdrovvisoria)». The «ignore» option has no place in debugging.

In any case, the fact that some of the file names had some kind of error but appeared NOT to lose characters with the «ignore» option (or appeared NOT to be mangled) is suspicious.

Which part of «»»Insert the statement print repr(folder), repr(f) «»» did you not understand? All that you need to do is something like this:

print "Some meaningful text" # "error reading file" isn't
print "folder:", repr(folder)
print "f:", repr(f)

(13) It also appears that you have introduced UTF-8 elsewhere in your code, judging by the traceback: self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)

I would like to point out that you still do not know whether folder and f refer to str objects or unicode objects, and two answers have suggested that they are very likely to be str objects, so why introduce blahbah.encode() ??

A more general point: Try to understand what your problem(s) is/are, BEFORE changing your script. Thrashing about trying every suggestion coupled with near-zero effective debugging technique is not the way forward.

(14) When you run your script again, you might like to reduce the volume of the output by running it over some subset of C:\ … especially if you proceed with my original suggestion to have debug printing of ALL file names, not just the erroneous ones (knowing what non-error ones look like could help in understanding the problem).

Response to Bryan McLemore’s «clean up» function:

(15) Here is an annotated interactive session that illustrates what actually happens with os.walk() and non-ASCII file names:

C:\junk\terabytest>dir
[snip]
 Directory of C:\junk\terabytest

20/11/2009  01:28 PM    <DIR>          .
20/11/2009  01:28 PM    <DIR>          ..
20/11/2009  11:48 AM    <DIR>          empty
20/11/2009  01:26 PM                11 Hašek.txt
20/11/2009  01:31 PM             1,419 tbyte1.py
29/12/2007  09:33 AM                 9 Ð.txt
               3 File(s)          1,439 bytes
[snip]

C:\junk\terabytest>\python26\python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] onwin32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pprint import pprint as pp
>>> import os

os.walk(unicode_string) -> results in unicode objects

>>> pp(list(os.walk(ur"c:\junk\terabytest")))
[(u'c:\\junk\\terabytest',
  [u'empty'],
  [u'Ha\u0161ek.txt', u'tbyte1.py', u'\xd0.txt']),
 (u'c:\\junk\\terabytest\\empty', [], [])]

os.walk(str_string) -> results in str objects

>>> pp(list(os.walk(r"c:\junk\terabytest")))
[('c:\\junk\\terabytest',
  ['empty'],
  ['Ha\x9aek.txt', 'tbyte1.py', '\xd0.txt']),
 ('c:\\junk\\terabytest\\empty', [], [])]

cp1252 is the encoding I’d expect to be used on my system …

>>> u'\u0161'.encode('cp1252')
'\x9a'
>>> 'Ha\x9aek'.decode('cp1252')
u'Ha\u0161ek'

decoding the str with UTF-8 doesn’t work, as expected

>>> 'Ha\x9aek'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python26\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 2: unexpected code byte

ANY random string of bytes can be decoded without error using latin1

>>> 'Ha\x9aek'.decode('latin1')
u'Ha\x9aek'

BUT U+009A is a control character (SINGLE CHARACTER INTRODUCER), i.e. meaningless gibberish; absolutely nothing to do with the correct answer

>>> unicodedata.name(u'\u0161')
'LATIN SMALL LETTER S WITH CARON'
>>>

(16) That example shows what happens when the character is representable in the default character set; what happens if it’s not? Here’s an example (using IDLE this time) of a file name containing CJK ideographs, which definitely aren’t representable in my default character set:

IDLE 2.6.4      
>>> import os
>>> from pprint import pprint as pp

repr(Unicode results) looks fine

>>> pp(list(os.walk(ur"c:\junk\terabytest\chinese")))
[(u'c:\\junk\\terabytest\\chinese', [], [u'nihao\u4f60\u597d.txt'])]

and the unicode displays just fine in IDLE:

>>> print list(os.walk(ur"c:\junk\terabytest\chinese"))[0][2][0]
nihao你好.txt

The str result is evidently produced by using .encode(whatever, «replace») — not very useful e.g. you can’t open the file by passing that as the file name.

>>> pp(list(os.walk(r"c:\junk\terabytest\chinese")))
[('c:\\junk\\terabytest\\chinese', [], ['nihao??.txt'])]

So the conclusion is that for best results, one should pass a unicode string to os.walk(), and deal with any display problems.

Why is the below item failing? Why does it succeed with «latin-1» codec?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

Which results in:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

Dr. Mantis Tobbogan's user avatar

asked Apr 5, 2011 at 13:23

RuiDC's user avatar

I had the same error when I tried to open a CSV file by pandas.read_csv
method.

The solution was change the encoding to latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

Vishal Singh's user avatar

Vishal Singh

6,0142 gold badges17 silver badges33 bronze badges

answered Jul 18, 2015 at 15:33

Mazen Aly's user avatar

Mazen AlyMazen Aly

5,7151 gold badge15 silver badges12 bronze badges

2

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

(Note, I’m using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

answered Apr 5, 2011 at 13:29

Josh Lee's user avatar

Josh LeeJosh Lee

171k38 gold badges270 silver badges275 bronze badges

2

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don’t know the codeset you’re receiving strings in, you’re in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you’d just reject ones that didn’t decode.

If you can’t do that, you’ll need heuristics.

answered Apr 5, 2011 at 13:26

Sami J. Lehtinen's user avatar

1

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

answered Apr 5, 2011 at 13:28

neurino's user avatar

neurinoneurino

11.6k3 gold badges40 silver badges63 bronze badges

1

Use this, If it shows the error of UTF-8

pd.read_csv('File_name.csv',encoding='latin-1')

answered Apr 14, 2020 at 7:21

Anshul Singh Suryan's user avatar

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

answered Jul 4, 2018 at 23:09

Patrick Mutuku's user avatar

2

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.

the reason to raise this exception is:

1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

In order to to overcome this we have a set of encodings, the most widely used is «Latin-1, also known as ISO-8859-1»

So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1

when this exception occurs when you are trying to load a data set ,try using this format

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

Add encoding technique at the end of the syntax which then accepts to load the data set.

HK boy's user avatar

HK boy

1,39611 gold badges17 silver badges25 bronze badges

answered Jan 18, 2020 at 14:37

surya's user avatar

suryasurya

1911 silver badge3 bronze badges

1

Well this type of error comes when u are taking input a particular file or data in pandas such as :-

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)

Then the error is displaying like this :-
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf4 in position 1: invalid continuation byte

So to avoid this type of error can be removed by adding an argument

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

pppery's user avatar

pppery

3,72723 gold badges33 silver badges46 bronze badges

answered Jun 26, 2020 at 17:59

Aditya Aggarwal's user avatar

2

This happened to me also, while i was reading text containing Hebrew from a .txt file.

I clicked: file -> save as and I saved this file as a UTF-8 encoding

answered Feb 21, 2019 at 7:53

Alon Gouldman's user avatar

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.

I got this error as I was processing a large number of zip files with additional zip files in them.

My workflow was the following:

  1. Read zip
  2. Read child zip
  3. Read text from child zip

At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

answered Apr 17, 2022 at 10:32

malvoisen's user avatar

In this case, I tried to execute a .py which active a path/file.sql.

My solution was to modify the codification of the file.sql to «UTF-8 without BOM» and it works!

You can do it with Notepad++.

i will leave a part of my code.

con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])

cursor = con.cursor()
sqlfile = open(path, 'r')

Zrufy's user avatar

Zrufy

4239 silver badges22 bronze badges

answered Jun 19, 2019 at 21:26

Martin Taco's user avatar

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.

What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

answered Sep 26, 2022 at 19:21

Nesha25's user avatar

Nesha25Nesha25

3814 silver badges11 bronze badges

The solution was change to «UTF-8 sin BOM»

answered Jun 2, 2021 at 21:06

masilva70 masilva70's user avatar

Ошибки UnicodeDecodeError встречаются довольно часто при работе с данными в Python, особенно при чтении файлов с помощью библиотеки Pandas. Это происходит из-за того,

Ошибки UnicodeDecodeError встречаются довольно часто при работе с данными в Python, особенно при чтении файлов с помощью библиотеки Pandas. Это происходит из-за того, что файл содержит символы, которые не могут быть прочитаны с использованием выбранной кодировки.

Вот типичный пример такой ошибки:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid continuation byte

Это значит, что Python пытается прочитать файл с использованием кодировки UTF-8, но встречает символ (в данном случае, байт с значением 0xda), который не является допустимым в этой кодировке.

Самый простой способ решения этой проблемы – поменять кодировку при чтении файла. Например, можно использовать кодировку ‘latin1’ или ‘ISO-8859-1’, которые являются более «мягкими» и могут прочитать больше символов, чем UTF-8.

Это можно сделать с помощью параметра encoding функции read_csv:

data = pd.read_csv(filepath, names=fields, encoding='latin1')

Если неизвестно, какая кодировка используется в файле, можно воспользоваться библиотекой chardet, которая позволяет автоматически определить кодировку:

import chardet

with open(filepath, 'rb') as f:
    result = chardet.detect(f.read())

data = pd.read_csv(filepath, names=fields, encoding=result['encoding'])

Однако стоит отметить, что использование «мягкой» кодировки может привести к тому, что некоторые символы будут прочитаны неправильно. Если точное представление каждого символа важно для дальнейшей работы, возможно, придется вручную обработать эти символы или связаться с источником данных для уточнения используемой кодировки.

PEP: Python3 and UnicodeDecodeError

This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. It’s a draft, don’t hesitate to comment it. This document suppose that my patch to allow bytes filenames is accepted which is not the case today.

While I was writing this document I found poential problems in Python3. So here is a TODO list (things to be checked):

  • FIXME: When bytearray is accepted or not?
  • FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get bytes or unicode?

Can anyone write a section about bytes encoding in Unicode using escape sequence?

What is the best tool to work on a PEP? I hate email threads, and I would prefer SVN / Mercurial / anything else.


Python3 and UnicodeDecodeError for the command line, environment variables and filenames

Introduction

Python3 does its best to give you texts encoded as a valid unicode characters strings. When it hits an invalid bytes sequence (according to the used charset), it has two choices: drops the value or raises an UnicodeDecodeError. This document present the behaviour of Python3 for the command line, environment variables and filenames.

Example of an invalid bytes sequence: ::

>>> str(b'\xff', 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...)

whereas the same byte sequence is valid in another charset like ISO-8859-1: ::

>>> str(b'\xff', 'iso-8859-1')
'ÿ'

Default encoding

Python uses «UTF-8» as the default Unicode encoding. You can read the default charset using sys.getdefaultencoding(). The «default encoding» is used by PyUnicode_FromStringAndSize().

A function sys.setdefaultencoding() exists, but it raises a ValueError for charset different than UTF-8 since the charset is hardcoded in PyUnicode_FromStringAndSize().

Command line

Python creates a nice unicode table for sys.argv using mbstowcs(): ::

$ ./python -c 'import sys; print(sys.argv)' 'Ho hé !'
['-c', 'Ho hé !']

On Linux, mbstowcs() uses LC_CTYPE environement variable to choose the encoding. On an invalid bytes sequence, Python quits directly with an exit code 1. Example with UTF-8 locale:

$ python3.0 $(echo -e 'invalid:\xff')
Could not convert argument 1 to string

Environment variables

Python uses «_wenviron» on Windows which are contains unicode (UTF-16-LE) strings.  On other OS, it uses «environ» variable and the UTF-8 charset. It drops a variable if its key or value is not convertible to unicode. Example:

env -i HOME=/home/my PATH=$(echo -e "\xff") python
>>> import os; list(os.environ.items())
[('HOME', '/home/my')]

Both key and values are unicode strings. Empty key and/or value are allowed.

Python ignores invalid variables, but values still exist in memory. If you run a child process (eg. using os.system()), the «invalid» variables will also be copied.

Filenames

Introduction

Python2 uses byte filenames everywhere, but it was also possible to use unicode filenames. Examples:

  • os.getcwd() gives bytes whereas os.getcwdu() always returns unicode
  • os.listdir(unicode) creates bytes or unicode filenames (fallback to bytes on UnicodeDecodeError), os.readlink() has the same behaviour

  • glob.glob() converts the unicode pattern to bytes, and so create bytes filenames
  • open() supports bytes and unicode

Since listdir() mix bytes and unicode, you are not able to manipulate easily filenames:

>>> path=u'.'
>>> for name in os.listdir(path):
...  print repr(name)
...  print repr(os.path.join(path, name))
...
u'valid'
u'./valid'
'invalid\xff'
Traceback (most recent call last):
...
File "/usr/lib/python2.5/posixpath.py", line 65, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff (...)

Python3 supports both types, bytes and unicode, but disallow mixing them. If you ask for unicode, you will always get unicode or an exception is raised.

You should only use unicode filenames, except if you are writing a program fixing file system encoding, a backup tool or you users are unable to fix their broken system.

Windows

Microsoft Windows since Windows 95 only uses Unicode (UTF-16-LE) filenames. So you should only use unicode filenames.

Non Windows (POSIX)

POSIX OS like Linux uses bytes for historical reasons. In the best case, all filenames will be encoded as valid UTF-8 strings and Python creates valid unicode strings. But since system calls uses bytes, the file system may returns an invalid filename, or a program can creates a file with an invalid filename.

An invalid filename is a string which can not be decoded to unicode using the default file system encoding (which is UTF-8 most of the time).

A robust program will have to use only the bytes type to make sure that it can open / copy / remove any file or directory.

Filename encoding

Python use:

  • «mbcs» on Windows
  • or «utf-8» on Mac OS X
  • or nl_langinfo(CODESET) on OS supporting this function
  • or UTF-8 by default

«mbcs» is not a valid charset name, it’s an internal charset saying that Python will use the function MultiByteToWideChar() to decode bytes to unicode. This function uses the current codepage to decode bytes string.

You can read the charset using sys.getfilesystemencoding(). The function may returns None if Python is unable to determine the default encoding.

PyUnicode_DecodeFSDefaultAndSize() uses the default file system encoding, or UTF-8 if it is not set.

On UNIX (and other operating systems), it’s possible to mount different file systems using different charsets. sys.getdefaultencoding() will be the same for the different file systems since this encoding is only used between Python and the Linux kernel, not between the kernel and the file system which may uses a different charset.

Display a filename

Example of a function formatting a filename to display it to human eyes: ::

from sys import getfilesystemencoding
def format_filename(filename):
    return str(filename, getfilesystemencoding(), 'replace')

Example: format_filename(‘r\xffport.doc’) gives ‘r�port.doc’ with the UTF-8 encoding.

Functions producing filenames

Policy: for unicode arguments: drop invalid bytes filenames; for bytes arguments: return bytes

  • os.listdir()
  • glob.glob()

This behaviour (drop silently invalid filenames) is motivated by the fact to if a directory of 1000 files only contains one invalid file, listdir() fails for the whole directory. Or if your directory contains 1000 python scripts (.py) and just one another document with an invalid filename (eg. r�port.doc), glob.glob(‘*.py’) fails whereas all .py scripts have valid filename.

Policy: for an unicode argument: raise an UnicodeDecodeError on invalid filename; for an bytes argument: return bytes

  • os.readlink()

Policy: create unicode directory or raise an UnicodeDecodeError

  • os.getcwd()

Policy: always returns bytes

  • os.getcwdb()

Functions for filename manipulation

Policy: raise TypeError on bytes/str mix

  • os.path.*(), eg. os.path.join()
  • fnmatch.*()

Functions accessing files

Policy: accept both bytes and str

  • io.open()
  • os.open()
  • os.chdir()
  • os.stat(), os.lstat()
  • os.rename()
  • os.unlink()
  • shutil.*()

os.rename(), shutil.copy*(), shutil.move() allow to use bytes for an argment, and unicode for the other argument

bytearray

In most cases, bytearray() can be used as bytes for a filename.

Unicode normalisation

Unicode characters can be normalized in 4 forms: NFC, NFD, NFKC or NFKD. Python does never normalize strings (nor filenames). No operating system does normalize filenames. So the users using different norms will be unable to retrieve their file. Don’t panic! All users use the same norm.

Use unicodedata.normalize() to normalize an unicode string.

Понравилась статья? Поделить с друзьями:
  • Ошибка unhandled exception caught call duty
  • Ошибка u1826 cadillac
  • Ошибка u1807 citroen c4
  • Ошибка u1751 86 ipc неверный сигнал
  • Ошибка u171a фиат дукато 250