sklearn.feature_extraction.CountVectorizer-scikit-learn中文社区

相关文章推荐

笑点低的橙子 · 公共卫生-深圳市卫生健康委员会网站· 7 月前 ·

阳刚的紫菜汤 · 花落成牢-第41话浪漫，太浪漫了· 1 年前 ·

玩篮球的柿子 · 新出公告！公开招聘教师143人！_补助_滨州_发展· 1 年前 ·

无聊的小笼包 · 迭戈-略伦特：与穆帅在皇马的过往如同昨日，很 ...· 1 年前 ·

忐忑的登山鞋 · 所谓共同警长：日本与美国在印太的“小多边” ...· 1 年前 ·

class sklearn.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

收集的文本文档转换为矩阵的令牌数量

这个实现产生的稀疏表示使用scipy.sparse.csr_matrix计数。

如果你不提供一个先天的字典和你不使用一个分析器,某种特征选择特性的数量就会等于词汇量大小发现通过分析数据。

在用户指南中阅读更多内容。

input string {‘filename’, ‘file’, ‘content’}, default=’content’
如果“filename”，作为参数传递给fit的序列应该是一个文件名列表，需要读取这些文件名以获取要分析的原始内容。

如果“file”，序列项必须有一个“read”方法(类文件对象)，该方法被调用来获取内存中的字节。

否则，输入应该是一个项目序列，类型可以是string或byte。 encoding string, default=’utf-8’
如果字节或文件被给予分析，这种编码被用来解码。 decode_error {‘strict’, ‘ignore’, ‘replace’}, default=’strict’
说明如果给定要分析的字节序列包含不属于给定


    编码

的字符，该做什么。默认情况下，它是“严格的”，这意味着将引发一个UnicodeDecodeError。其他值还有“ignore”和“replace”。 strip_accents {‘ascii’, ‘unicode’}, default=None
在预处理步骤中删除重音符号并执行其他字符规范化。' ascii '是一种快速的方法，只对有直接ascii映射的字符有效。“unicode”是一种稍微慢一些的方法，适用于任何字符。None(默认)不执行任何操作。

' ascii '和' unicode '都使用NFKD标准化


     unicodedata.normalize

.。 lowercase bool, default=True
在标记之前将所有字符转换为小写。 preprocessor callable, default=None
重写预处理(字符串转换)阶段，同时保留记号化和n字元生成步骤。仅在分析器不可调用时应用。 tokenizer callable, default=None
重写字符串记号化步骤，同时保留预处理和n字元生成步骤。只适用于analyzer == 'word'。 stop_words string {‘english’}, list, default=None
如果“english”，则使用内置的英语停止词列表。“英语”有几个已知的问题，你应该考虑另一种选择(参见 Using stop words )。

如果“english”，则使用内置的英语停止词列表。“英语”有几个已知的问题，你应该考虑另一种选择(参见使用停止词)。

如果一个列表，则假定该列表包含停止词，所有这些词都将从结果标记中删除。只适用于


    analyzer == 'word'

。

如果没有，就不会使用停止语。


    max_df

可以设置为范围[0.7,1.0]的值，根据术语在语料库文档内的频率自动检测和过滤停止词。 token_pattern string
表示什么构成了“记号”的正则表达式，仅在analyzer == 'word'时使用。默认的regexp选择2个或更多字母数字字符的标记(标点完全被忽略，总是作为标记分隔符处理)。 ngram_range tuple (min_n, max_n), default=(1, 1)
要提取的不同单词的n个字符或字符的n个字符的n个值范围的上边界。使用


    min_n <= n <= max_n

的所有n值。例如，


    ngram_range

的(1,1)表示仅使用双字符，(1,2)表示单字符和双字符，(2,2)表示仅使用双字符。仅在分析器不可调用时应用。 analyzer string, {‘word’, ‘char’, ‘char_wb’} or callable, default=’word’
该特征是由n个字母组成还是由n个字母组成。选择“char_wb”创建角色-


    gram

只从文本单词边界;字格词带的边缘空间。

如果传递了


    callable

，则使用它从原始的、未处理的输入中提取特性序列。
在0.21版本中进行了更改。

由于v0.21，如果输入是文件名或文件，则首先从文件读取数据，然后传递给给定的可调用分析器。 max_df float in range [0.0, 1.0] or int, default=1.0
在构建词汇表时，忽略那些文档频率严格高于给定阈值的术语(特定于语料库的停止词)。如果是浮点数，则该参数表示文档的比例，整数绝对计数。如果词汇表不是None，则忽略此参数。 min_df float in range [0.0, 1.0] or int, default=1
在构建词汇表时，忽略那些文档频率严格低于给定阈值的术语。这个值在文献中也称为


    cut-off

。如果是浮点数，则该参数表示文档的比例，整数绝对计数。如果词汇表不是


    None

，则忽略此参数。 max_features int, default=None
如果没有的话，构建一个词汇表，只考虑根据语料库中的词汇频率排序的顶部max_features。

如果词汇表不是


    None

，则忽略此参数。 vocabulary Mapping or iterable, default=None
一种映射(例如dict)，其中键是项，值是特征矩阵中的索引，或者是项上的迭代。如果没有给出，则从输入文档中确定词汇表。映射中的索引不应该重复，并且0和最大索引之间不应该有任何差距。 binary bool, default=False
如果为真，则将所有非零计数设置为1。这对于建模二进制事件而不是整数计数的离散概率模型是有用的。 dtype type, default=np.int64
由


    fit_transform()

或


    transform()

返回的矩阵的类型。

在pickle时， stop_words_ 属性会变大，增加模型的大小。此属性仅用于自省，可以使用delattr安全地删除该属性，或在pickle之前将其设置为 None 。

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
>>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> print(vectorizer2.get_feature_names())
['and this', 'document is', 'first document', 'is the', 'is this',
'second document', 'the first', 'the second', 'the third', 'third one',
 'this document', 'this is', 'this the']
 >>> print(X2.toarray())
 [[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]

__init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

初始化self. See 请参阅help(type(self))以获得准确的说明。

build_analyzer()

返回处理预处理、记号化和生成n个符号的可调用函数。