【Kaggle竞赛】迭代训练模型
Contents
CV领域中,在完成数据准备工作和设计定义好模型之后,我们就可以去迭代训练模型了,通过设置调节不同的超参数(这需要理论知识和丰富的经验)来使得损失(loss)和准确率(accuracy)这两个常用的指标达到最优。一般在训练完成之后,都需要通过损失曲线图和准确率曲线图来衡量整个训练过程。
在训练模型之前,我们需要将数据划分为训练集和验证集,在训练集上训练模型,在验证集上评估模型。最后一旦找到了模型的最佳参数,就在测试集上最后测试一次,并将得到的测试结果储存为CSV文件,提交到Kaggle平台上,看分数如何,以便进行后期的改正。
数据集的划分有三种常用的方法:
- 简单的留出验证;
- K折交叉验证;
- 带有打乱数据的重复K折验证;
知道了训练模型的一些方法和注意事项之后,我们就要开始编写TensorFlow程序,以实现迭代训练模型,并将最终的模型保存下来。这里需要先学习TensorFlow模型持久化(即如何保存和恢复模型)。
TensorFlow模型持久化
主要介绍如何编写TensorFlow程序来持久化一个训练好的模型,并从持久化的模型文件中还原被保存的模型。TensorFlow提供一个tf.train.Saver类用于保存和还原一个神经网络模型。
保存模型
以下程序是保存模型的示例:
import tensorflow as tf
# 模型保存地址
model_path = 'C:/Users/Administrator/logs/model.ckpt'
# 声明两个变量并计算他们的和
v1 = tf.Variable(tf.constant(1.0,shape=[1]),name="v1")
v2 = tf.Variable(tf.constant(3.0,shape=[1]),name="v2")
result = v1 + v2
# 声明tf.train.Saver类用于保存模型
saver = tf.train.Saver()
with tf.Session() as sess:
# 初始化所有变量
sess.run(tf.global_variables_initializer())
# 将模型保存到指定文件中
saver.save(sess,model_path)
输出结果如下:
可以看到在模型保存地址中出现了4个文件,这是因为TensorFlow会将计算图的结构和参数取值分开保存。
- model.ckpt.meta 保存了计算图的结构
- model.ckpt.data-00000-of-00001 保存了计算图上的每个变量的取值
- checkpoint 保存了目录下的所有的模型文件列表,方便还原模型时直接调用
- model.ckpt.index 暂时用不到
加载模型
加载模型有两种常见方法:
- 在加载模型的程序中定义TensorFlow计算图上的所有运算;
- 不重复定义计算图上运算,直接加载已经持久化的图。
第一种方法示例代码如下:
import tensorflow as tf
# 模型保存地址
model_path = 'C:/Users/Administrator/logs/model.ckpt'
# 使用和保存模型代码中一样的方式来声明变量和定义计算图结构
v1 = tf.Variable(tf.constant(1.0,shape=[1]),name="v1")
v2 = tf.Variable(tf.constant(3.0,shape=[1]),name="v2")
result = v1 + v2
saver = tf.train.Saver()
with tf.Session() as sess:
# 加载已经保存的模型,并通过已经保存的模型中变量的值来计算加法
saver.restore(sess,'C:/Users/Administrator/logs/model.ckpt')
print(sess.run(result))
第二种方法示例代码如下:
import tensorflow as tf
# 模型保存地址
model_path = 'C:/Users/Administrator/logs/model.ckpt'
saver = tf.train.import_meta_graph('C:/Users/Administrator/logs/model.ckpt.meta')
with tf.Session() as sess:
saver.restore(sess,model_path)
print(sess.run(tf.get_default_graph().get_tensor_by_name("add:0")))
两种方法输出结果一样,如下图所示:
INFO:tensorflow:Restoring parameters from C:/Users/Administrator/logs/model.ckpt [ 4.]
迭代训练模型实现
程序代码如下:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os
import time
# 导入模型定义文件和数据准备文件
import model
import input_data
# ---------------------------配置神经网络超参数-------------------------------------------
N_CLASSES = 2 # 输出类别数
IMG_W = 227 # 图像宽度
IMG_H = 227 # 图像高度
IMG_C = 3 # 图像通道
BATCH_SIZE = 10 # 训练集批次大小
MAX_STEP = 20000 # 最大迭代步数
CAPACITY = 2000 # 用于定义的范围
LEARNING_RETE = 0.0001 # 定义学习率
# 本地电脑训练集对应路径地址,和模型及日志文件保存地址
train_dir = "F:/Software/Python_Project/Classification-cat-dog/train/"
logs_train_dir = "F:/Software/Python_Project/Classification-cat-dog/logs/"
# 云服务器训练集对应路径地址,和模型及日志文件保存地址
# train_dir = '/data/Dogs-Cats-Redux-Kernels-Edition/train/'
# logs_train_dir = '/data/Dogs-Cats-Redux-Kernels-Edition/logs/
# ---------------------------定义模型训练函数-------------------------------------------
def run_training():
# 获取训练集文件名和对应标签列表
file_list, label_list = input_data.get_files(train_dir)
# 生成一个batch的图像数据和标签
train_batch, train_label_batch = input_data.get_batch(file_list,
label_list,
IMG_W,
IMG_H,
BATCH_SIZE,
CAPACITY)
regularizer = tf.contrib.layers.l2_regularizer(0.0001)
# 获取训练batch数据网络输出结果
train_logits = model.inference(train_batch, True,BATCH_SIZE,regularizer, N_CLASSES)
train_loss = model.losses(train_logits, train_label_batch) # 计算训练batch的损失
train_op = model.trainning(train_loss, LEARNING_RETE) # 利用损失和学习率更新网络权重W参数
train_acc = model.evaluation(train_logits, train_label_batch) # 计算准确率
# 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码
x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_')
y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_')
# # 获取训练batch的模型输出结果,logits是一个batch_size*2的二维数组
# logits = model.inference(x,True,BATCH_SIZE,regularizer,N_CLASSES)
# # (小处理)将logits乘以1赋值给logits_eval,定义name,方便在后续调用模型时通过tensor名字调用输出tensor
# b = tf.constant(value=1,dtype=tf.float32)
# logits_eval = tf.multiply(logits,b,name='logits_eval')
# # 计算交叉熵作为刻画预测值和真实值之间差距的损失函数
# cross_entroy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y_)
# # 计算在当前batch中所有样例的交叉熵平均值
# loss = tf.reduce_mean(cross_entroy,name='loss')
# # 使用tf.train.AdamOptimizer优化算法来优化损失函数
# train_op = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)
# # 计算模型在一个batch数据上的正确率
# correct_prediction = tf.equal(tf.cast(tf.argmax(logits,1),tf.int32), y_)
# acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# print(loss.shape,acc.shape)
with tf.Session() as sess:
tra_loss = [] # 定义loss列表
# 初始化TensorFlow持久化类
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer()) # 初始化所有变量
coord = tf.train.Coordinator() # 设置多线程协调器
threads = tf.train.start_queue_runners(sess=sess,coord=coord) # 开始队列运行器(Queue Runner)
summary_op = tf.summary.merge_all() # 汇总操作
# 把训练的汇总写入logs_train_dir
train_writer = tf.summary.FileWriter(logs_train_dir,sess.graph)
try: # 开始运行
for step in np.arange(MAX_STEP):
if coord.should_stop():
break
image_batch,label_batch = sess.run([train_batch, train_label_batch])
# 计算损失和准确率
_,loss,acc = sess.run([train_op,train_loss,train_acc],feed_dict={x_train:image_batch,y_train_:label_batch})
# print(type(train_loss),type(train_acc),train_acc.shape)
tra_loss.append(loss)
# train_loss_val = np.sum(train_loss)
# train_acc_val = np.sum(train_acc)
if step % 100 == 0: # 当step值为10的倍数时,打印损失和准确率
print('Step %d,train loss = %.2f,train accuracy = %.2f%%' % (step,train_loss,train_acc*100.0))
# print("Step %d, loss: %.2f, ac : %.2f" % (step,loss,acc))
summary_str = sess.run(summary_op)
train_writer.add_summary(summary_str,step)
if step % 2000 == 0 or (step+1) == MAX_STEP: # 保存模型
checkpoint_path = os.path.join(logs_train_dir,'model.ckpt')
saver.save(sess,checkpoint_path,global_step=step)
print('Model saved')
print('Traning finished!')
except tf.errors.OutOfRangeError: # 异常处理
print('Done training -- epoch limit reached.')
finally:
# 停止所有线程
coord.request_stop()
coord.join(threads)
# 绘制损失函数趋势曲线图
plt.plot(loss)
plt.xlabel('Iter')
plt.ylabel('loss')
plt.title('lr=%f,ti=%d,bs=%d' % (LEARNING_RETE,MAX_STEP,BATCH_SIZE))
plt.tight_layout()
plt.savefig('cat_and_dog_alexnet.jpg',dpi=200)
#-------------------------------程序从这里开始运行---------------------------------------
if __name__ == "__main__":
run_training()
输出结果
受笔记本性能限制,我写这篇博客的时候,模型还没有训练完成,我这里只截取了部分结果,最终的输出和loss及accuracy曲线图分析,明天补上。
There are 12500 cats There are 12500 dogs Step 0, train loss = 113810.02, train accuracy = 50% Step 100, train loss = 20647.10, train accuracy = 40% Step 200, train loss = 16054.08, train accuracy = 50% Step 300, train loss = 7717.75, train accuracy = 50% Step 400, train loss = 5881.07, train accuracy = 50% Step 500, train loss = 2879.47, train accuracy = 70% Step 600, train loss = 338.30, train accuracy = 70% Step 700, train loss = 1178.86, train accuracy = 50% Step 800, train loss = 287.65, train accuracy = 50% Step 900, train loss = 245.80, train accuracy = 50% Step 1000, train loss = 20.37, train accuracy = 50% Step 1100, train loss = 49.53, train accuracy = 60% Step 1200, train loss = 11.61, train accuracy = 60% Step 1300, train loss = 1.78, train accuracy = 70% Step 1400, train loss = 10.86, train accuracy = 30% Step 1500, train loss = 2.33, train accuracy = 30% Step 1600, train loss = 26.34, train accuracy = 40% Step 1700, train loss = 43.71, train accuracy = 50% Step 1800, train loss = 14.57, train accuracy = 60% Step 1900, train loss = 23.90, train accuracy = 30% Step 2000, train loss = 1.50, train accuracy = 50% Step 2100, train loss = 3.84, train accuracy = 50% Step 2200, train loss = 1.06, train accuracy = 60% Step 2300, train loss = 1.90, train accuracy = 50% Step 2400, train loss = 8.90, train accuracy = 50% Step 2500, train loss = 4.88, train accuracy = 40% Step 2600, train loss = 1.83, train accuracy = 70% Step 2700, train loss = 3.73, train accuracy = 40% Step 2800, train loss = 40.79, train accuracy = 40% Step 2900, train loss = 57.23, train accuracy = 40% Step 3000, train loss = 1.04, train accuracy = 80% Step 3100, train loss = 1.16, train accuracy = 50% Step 3200, train loss = 2.04, train accuracy = 50% Step 3300, train loss = 49.13, train accuracy = 50% Step 3400, train loss = 1.67, train accuracy = 70% Step 3500, train loss = 2.48, train accuracy = 40% Step 3600, train loss = 2.01, train accuracy = 50% Step 3700, train loss = 2.04, train accuracy = 60% Step 3800, train loss = 0.62, train accuracy = 60% Step 3900, train loss = 3.46, train accuracy = 40% Step 4000, train loss = 1.15, train accuracy = 50% Step 4100, train loss = 2.64, train accuracy = 40% Step 4200, train loss = 1.08, train accuracy = 60% Step 4300, train loss = 5.22, train accuracy = 60% Step 4400, train loss = 7.35, train accuracy = 50% Step 4500, train loss = 0.60, train accuracy = 90% Step 4600, train loss = 1.60, train accuracy = 80% Step 4700, train loss = 1.02, train accuracy = 50% Step 4800, train loss = 1.46, train accuracy = 60% Step 4900, train loss = 1.33, train accuracy = 40%
使用输入文件队列的注意事项
关于训练数据输入神经网络的方法,我之前有用过直接使用numpy打乱及划分batch,然后通过占位符placeholder输入给神经网络,也使用过TensorFlow输入文件队列(tf.train.shuffle_batch)的方法输入Tensor数据给神经网络,两个方法都行得通。
但是,我这两天发现TensorFlow有个巨坑的地方,就是你利用文件队列的方式去进行输入数据处理,你必须将tf.train.batch方法输出的张量数据直接输入到神经网络中,不能通过占位符的方式,否则就会报如下错误:
TypeError,must be real number,not Tensor
也有可能报如下错误:
InvalidArgumentError: You must feed a value for placeholder tensor ‘x_’ with dtype float and shape [10,227,227,3] [[Node: x_ = Placeholder[dtype=DT_FLOAT, shape=[10,227,227,3], _device=”/job:localhost/replica:0/task:0/device:GPU:0″]()]]
至于原因,我也不知道为什么,还没有去细细深究,但这是我踩了两天的坑才发现的,以前也没人提过这个问题!我上面说的可能还不是很清楚,直接看代码(只截取了关键部分)吧:
正确代码:
# 获取训练集文件名和对应标签列表
file_list, label_list = input_data.get_files(train_dir)
# 生成一个batch的图像数据和标签
train_batch, train_label_batch = input_data.get_batch(file_list,
label_list,
IMG_W,
IMG_H,
BATCH_SIZE,
CAPACITY)
regularizer = tf.contrib.layers.l2_regularizer(0.0001)
# 获取训练batch数据网络输出结果
train_logits = model.inference(train_batch, True,BATCH_SIZE,regularizer, N_CLASSES)
train_loss = model.losses(train_logits, train_label_batch) # 计算训练batch的损失
train_op = model.trainning(train_loss, LEARNING_RETE) # 利用损失和学习率更新网络权重W参数
train_acc = model.evaluation(train_logits, train_label_batch) # 计算准确率
# 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码
x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_')
y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_')
错误代码:
# 获取训练集文件名和对应标签列表
file_list, label_list = input_data.get_files(train_dir)
# 生成一个batch的图像数据和标签
train_batch, train_label_batch = input_data.get_batch(file_list,
label_list,
IMG_W,
IMG_H,
BATCH_SIZE,
CAPACITY)
# 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码
x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_')
y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_')