３目並べを利用したAlphaGoの学習 tensorflow-2編

1/28/2023

1.概要

AlphaGoの勉強過程で３目並べを学んでいます。前回までTensorflow-1.15を利用しました。今回からTensorflow-2.9に実行環境を変更して、tensorflow-1.15と同じことを実行します。tensorflow-1.xとtensorflow-2.xの比較をするために、できるだけtensorflow-1.15の設定やコードを変更せずに、実行するために必要な最低限の変更に留めます。

2.詳細学習

(a) 概要

３目並べのフィールドを３☓３のイメージと考えて、手書き文字認識の手法を利用します。利用する環境はtensorflow-2.9環境です。入力データはminimax法で活用したすべての手順(9!=362880)の組み合わせの中から勝負が決まった時点の３目並べのフィールド情報と結果（勝ち、負け、引き分け）を利用します。

tensorflowで利用できる形式に変換し、学習をしてモデルを作成し、モデルを利用して３目並べの対戦をします。元情報がminimax法で解析した情報なので、tensorflowによる学習結果がminimax法まで到達できると最高の結果です。大まかな手順は以下の通りです。

(1) minimax法の解析を利用して学習用入力データを作成

(2) 作成した学習用入力データを利用して、tensorflowでモデル作成

(3) tensorflowのモデルを利用して実際に対戦

(1)の学習用入力データ作成部分に変わりはないので、(2)(3)を２回に分割して記述します。

(b) 詳細

(2) 作成した学習用入力データを利用して、tensorflowでモデル作成

dl2tensorflow.pyを作成します。動作環境はtensorflow-2.9の環境です。

学習用データは、tensorflow-1.15と同じものを利用します。r1_data.npy(フィールドデータ)、r2_data.npy(結果データ)で、件数は、255,168件です。結果を知っているので、100%のデータを利用して、トレーニングをしてモデルを作成、保存します。完成したモデルは、dl2model.h5で保存します。

tensorflow-1.15では、epochは100に設定していました。しかし、tensorflow-1.15のように収束しません。600回以上に設定するとTensorflow-1.15のように収束します。また、グラフパラメータをlossに変更しました。tensorflow-1.15側のコードをlossに変更すると比較できます。

import tensorflow as tf

import numpy as np

import matplotlib.pyplot as plt

images = np.load('dl1_data.npy')

labels = np.load('dl2_data.npy')

print(images.shape, labels.shape)

count = int(images.shape[0] * 0.75)

train_images, test_images = np.split(images, [count])

print(train_images.shape, test_images.shape)

train_labels, test_labels = np.split(labels, [count])

print(train_labels.shape, test_labels.shape)

model = tf.keras.models.Sequential([

tf.keras.layers.Dense(64, activation='sigmoid', input_shape=(9,)),

tf.keras.layers.Dense(32, activation='sigmoid'),

tf.keras.layers.Dropout(rate=0.5),

tf.keras.layers.Dense(3, activation='softmax')

])

loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)

model.compile(optimizer='SGD',

loss=loss_fn,

metrics=['accuracy'])

history = model.fit(images, labels, batch_size=500,

epochs=600, validation_split=0.2)

plt.plot(history.history['accuracy'], label='accuracy')

plt.plot(history.history['loss'], label='loss')

plt.ylabel('accuracy')

plt.xlabel('epoch')

plt.legend(loc='best')

plt.show()

model.save('dl2model.h5')

model = tf.keras.models.load_model("dl2model.h5")

test_loss, test_acc = model.evaluate(test_images, test_labels)

print('loss: {:.3f}\nacc: {:.3f}'.format(test_loss, test_acc))

(3) tensorflowのモデルを利用して実際に対戦

ttttensorflow2.pyを作成します。動作環境はtensorflow-2.9環境です。この中で利用するtictactoe.pyはmontecarlo版を利用します。

トレーニングしたモデル(dl2model.h5)をロードします。この場合でも、alphabeta法で利用したis_reach()を利用しています。感触的にはtensorflow-1.15と同程度の手を打つようです。

from tictactoe import Tictactoe

import random

import tensorflow as tf

import numpy as np

def random_select(actions):

index = random.randint(0, len(actions) - 1)

return actions[index]

def input_select(actions):

while True:

print(actions)

action = int(input('select actions='))

if action in actions:

break

else:

print('input again')

return action

def tensorflow_select(actions):

model = tf.keras.models.load_model("dl2model.h5")

if (len(actions) % 2) == 1:

flg = 1

else:

flg = 2

result = []

for action in actions:

reach = obj.is_reach()

if reach != None:

print("reach action ", reach)

return reach

score = obj.do_game(action)

f1 = [obj.fields]

a1 = np.array(f1)

a2 = a1.astype(np.float32)

predictions = model.predict(a2)

l1 = predictions.tolist()

l1[0].append(action)

result.append(l1[0])

obj.undo_game(action)

maxvalue = -1

maxaction = None

for item in result:

value = item[flg-1]

if value > maxvalue:

maxvalue = value

maxaction = item[3]

return maxaction

def montecarlo_select(actions):

if (len(actions) % 2) == 1:

flg = 1

else:

flg = 2

result = []

for action in actions:

reach = obj.is_reach()

if reach != None:

print("reach action ", reach)

return reach

score = obj.do_game(action)

init = [action,0,0,0]

minimax(obj.next_action(), init)

result.append(init)

obj.undo_game(action)

print(result)

maxvalue = -1

maxaction = None

maxlist = []

for item in result:

value = item[flg]

if value > maxvalue:

maxvalue = value

maxaction = item[0]

maxlist = [item[0]]

elif value == maxvalue:

maxlist.append(item[0])

print('maxlist ', maxlist)

if len(maxlist) != 1:

maxaction = maxlist[random.randint(0, len(maxlist) - 1)]

print('maxaction ', maxaction)

return maxaction

def alphabeta_select(actions):

if (len(actions) % 2) == 1:

flg = 1

else:

flg = 2

result = []

for action in actions:

reach = obj.is_reach()

if reach != None:

print("reach action ", reach)

return reach

score = obj.do_game(action)

init = [action,0,0,0]

minimax(obj.next_action(), init)

result.append(init)

obj.undo_game(action)

print(result)

maxvalue = -1

maxaction = None

for item in result:

value = item[flg]

if value > maxvalue:

maxvalue = value

maxaction = item[0]

return maxaction

def minimax_select(actions):

if (len(actions) % 2) == 1:

flg = 1

else:

flg = 2

result = []

for action in actions:

score = obj.do_game(action)

init = [action,0,0,0]

minimax(obj.next_action(), init)

result.append(init)

obj.undo_game(action)

print(result)

maxvalue = -1

maxaction = None

for item in result:

value = item[flg]

if value > maxvalue:

maxvalue = value

maxaction = item[0]

return maxaction

def minimax(actions, result):

for action in actions:

score = obj.do_game(action)

if score == 1:

result[1] += 1

elif score == -1:

result[2] += 1

elif score == 0:

result[3] += 1

else:

minimax(obj.next_action(), result)

obj.undo_game(action)

if __name__ == "__main__":

obj = Tictactoe()

actions = [0,1,2,3,4,5,6,7,8]

for i in range(9):

if obj.myturn == True:

print('my turn')

action = tensorflow_select(actions)

else:

print('other turn')

action = montecarlo_select(actions)

print(actions)

print("select", action)

result = obj.do_game(action)

print(obj.game_state())

if result == 1:

print("o Win")

break;

if result == -1:

print("x Win")

break;

if result == 0:

print("Draw")

break;

actions = obj.next_action()

4.所見

Tensorflow-1とTensorflow-2で大差はないように感じます。

参考

[本ブログ内参照]

・テスト駆動開発を利用したfibonacciのコード作成
・３目並べを利用したAlphaGoの学習ロジック作成編
・３目並べを利用したAlphaGoの学習 tensorflow-1編

参考書籍

AlphaZero 深層学習・強化学習・探索人工知能プログラミング実践入門

布留川英一著

検索

Ubuntu User Blog