TONT 40093 有些文件在记事本里打开时怪怪的

请注意：本页内容发布于 2166 天前，内容可能已经过时，请注意甄别。

联通电池。

原文链接：https://blogs.msdn.microsoft.com/oldnewthing/20040324-00/?p=40093

David Cumps discovered that certain text files come up strange in Notepad.

The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.

原因是记事本需要应对不同编码的文件，而当被逼到没法的时候，（文件的编码）也就只能靠猜了。

Here’s the file “Hello” in various encodings:

以下是包含字符串『Hello』的文本文件，但编码不同：

48 65 6C 6C 6F

This is the traditional ANSI encoding.（这是传统的ANSI编码。）

48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with no BOM.（这是不带BOM的小端序Unicode编码。）

FF FE 48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with BOM. The BOM (FF FE) serves two purposes: First, it tags the file as a Unicode document, and second, the order in which the two bytes appear indicate that the file is little-endian.

这是带BOM的小端序Unicode编码。BOM（即开头的FF FE）用途有二：一来，标示该文件为Unicode编码；二来，这两个字节的顺序表明这个文件是小端序的。

00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with no BOM. Notepad does not support this encoding.

这是不带BOM的大端序Unicode编码。记事本不支持这种编码。

FE FF 00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with BOM. Notice that this BOM is in the opposite order from the little-endian BOM.

这是带BOM的大端序Unicode编码，注意BOM的字节顺序与小端序BOM相反。

EF BB BF 48 65 6C 6C 6F

This is UTF-8 encoding. The first three bytes are the UTF-8 encoding of the BOM.

这是UTF-8编码，开头的三个字节是UTF-8编码的BOM。

2B 2F 76 38 2D 48 65 6C 6C 6F

This is UTF-7 encoding. The first five bytes are the UTF-7 encoding of the BOM. Notepad doesn’t support this encoding.

这是UTF-7编码，开头的五个字节是UTF-7编码的BOM，记事本不支持这种编码。

Notice that the UTF7 BOM encoding is just the ASCII string “+/v8-“, which is difficult to distinguish from just a regular file that happens to begin with those five characters (as odd as they may be).

请注意，UTF-7的BOM头的编码正好是ASCII字符串『+/v8-』，如果文本文件正好以这五个字符开头，对猜测其编码会造成一定的困难（虽然以这五个字符开头本身就有点怪怪的了）。

The encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., “plain ASCII”) and the Unicode (little-endian) encoding with no BOM. When faced with a file that lacks a special prefix, Notepad is forced to guess which of those two encodings the file actually uses. The function that does this work is IsTextUnicode, which studies a chunk of bytes and does some statistical analysis to come up with a guess.

不包含任何特殊的前缀、但仍被记事本支持的编码是传统的ANSI编码（亦即所谓的纯ASCII）和不带BOM的小端序Unicode编码。当面对没有特殊前缀的文本文件时，记事本将被迫猜测文件实际使用的编码。用以处理这项业务的函数叫IsTextUnicode，通过对一块字节进行研究、并进行某些统计性分析来对文件的编码进行猜测。

And as the documentation notes, “Absolute certainty is not guaranteed.” Short strings are most likely to be misdetected.

并且这个函数的文档亦有注明『无法保证对编码绝对准确的猜测』。短小的字符串被猜错的几率相对会比较大。

Comments

石樱灯笼说道：

2019 年 2 月 14 日 17:09

联通电池是什么？
mmiaow说道：

2019 年 2 月 14 日 19:52

以前版本的Windows（大概是Win2K）中记事本存在的一个编码检测bug，复现步骤如下：
1、用记事本新建一个文本文档，录入『联通』两个字；
2、将该文件保存为ANSI编码的文本文件；
3、关闭记事本，然后重新打开刚刚保存的文件。
此时由于『联通』二字的编码开头数个字节与Unicode的特征头类似，记事本将使用Unicode编码尝试加载该文件，结果是用户看到的是一个小黑块，被用户戏称为『烧焦的联通电池』。
该问题直到Windows 10 1809中的新版记事本仍然存在，虽然不再显示为一个小黑块了，以及，甚至用Notepad++打开也会有这个问题。
更为详细的解释可以参考：https://www.cnblogs.com/candyboy/articles/1743033.html
石樱灯笼说道：

2019 年 2 月 21 日 15:08

一直是用 UTF-8 ，从不相信非国际通用的编码

TONT 40093 有些文件在记事本里打开时怪怪的

Comments

发表回复

存在 About

别景 Pages

霞影 Category

镜像 Mirror

他乡 Domains

助手 Homebrew

远嚣 Comment

时光 Counter

路标 Entrance