用pandas读一个txt文件,

data = pd.read_table(os.path.join(project_path, 'src/data/corpus.txt'), sep='\n')

出现如下错误:
'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

这个错误的原因是:

you cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.

也就是说,UTF-8是多字节编码,1-6位表示一个character,不能随意切分然后要python去解码。

解决方法:

  • 如果是因为文中有汉字,出现编码问题。这种情况应该加上 encodings ='utf-8'
  • 路径里面有中文。这种情况应该确保路径都为英文字母
  • 如果不是,根据github的这个讨论: https://github.com/pandas-dev/pandas/issues/43540 ,可以加上参数encoding_errors。
  • data = pd.read_table(os.path.join(project_path, 'src/data/corpus.txt'), sep='\n', encoding_errors='ignore')```