摘要 随着材料数据集的规模和范围不断扩大,数据挖掘和统计学习方法在分析这些材料数据集和构建预测模型方面的作用变得越来越重要。这份手稿介绍了 matminer,这是一个开源的、基于 Python 的软件平台,以促进数据驱动的方法来分析和预测材料特性。Matminer 提供了用于从外部数据库检索大型数据集的模块,例如 Materials Project、Citrination、Materials Data Facility 和 Materials Platform for Data Science。它还为材料社区开发的大量特征提取例程库提供了实现,具有 47 个特征化类,可以生成数千个单独的描述符并将它们组合成数学函数。最后,matminer 提供了一个可视化模块,用于生成交互式、可共享的绘图。这些函数的设计方式与 Python 数据科学社区已经开发和使用的机器学习和数据分析包紧密集成。我们解释了 matminer 的结构和逻辑,提供了其各种模块的描述,并展示了几个示例,说明如何使用 matminer 收集数据、重现文献中报告的数据挖掘研究以及测试新方法。 Abstract As materials data sets grow in size and scope, the role of data mining and statistical learning methods to analyze these materials data sets and build predictive models is becoming more important. This manuscript introduces matminer, an open-source, Python-based software platform to facilitate data-driven methods of analyzing and predicting materials properties. Matminer provides modules for retrieving large data sets from external databases such as the Materials Project, Citrination, Materials Data Facility, and Materials Platform for Data Science. It also provides implementations for an extensive library of feature extraction routines developed by the materials community, with 47 featurization classes that can generate thousands of individual descriptors and combine them into mathematical functions. Finally, matminer provides a visualization module for producing interactive, shareable plots. These functions are designed in a way that integrates closely with machine learning and data analysis packages already developed and in use by the Python data science community. We explain the structure and logic of matminer, provide a description of its various modules, and showcase several examples of how matminer can be used to collect data, reproduce data mining studies reported in the literature, and test new methodologies.