As I'm waiting for my wife to return home and travel with me to visit our greatgranddaughters on Easter, I am entertaining some deep thought about backpropagation in the neural net for the Connect 4 game. The biggest issue for me is formulating a continuous measure of how good a win or how bad a loss is numerically. That seems necessary to calculate a differentiable error function which is central to conventional backpropagation.
I'm beginning to think that the magnitude of the win or loss may not be needed to adjust the weights. The most important information is the polarity (win versus lose) of the result which tells us whether to increase or decrease the weights that contribute to that result. Perhaps we can do without the calculation and multiplication of derivatives altogether and still achieve learning. We could increase or decrease weights by small increments simply based on whether the last move by the AI resulted in a win or a loss.
I have only begun to consider the details of this approach, but it gives me a way to move forward without a smooth, differentiable error function.
Have a good holiday.
Captain's Mistress

 Posts: 361
 Joined: Tue Jun 21, 2022 2:17 pm
 Location: South coast UK
 Has thanked: 197 times
 Been thanked: 115 times
Re: Captain's Mistress
I'll try to address some of your comments.
The issue about ReLU remains to be seen. That's part of this excersise.
The only critical thing is that the function needs to return two different gradients depending on polarity.
For zero value inputs, is it really quicker to test for zero and then not do the calculation or simply multiply by zero?
Same for backprop, you need to calculate the gradient anyway, (which is the time consuming bit) and then multiply (possibly by zero).
Might give it a try once I've got some fully working code.
Number for the output choice. That's actually quite easy. "Softmax". As it does not need to be easily human readable or all add up to 100%, the cut down version "rectified mean square". The position that results in a win gets a high value, a lose gets a low value, then train accordingly.
What those values are, depends on the choice of activation function. I'll be using classic sigmoid for the final output.
Tests on other NNs I've done have shown that 0.2 for bad and 0.8 for good yeald the fastest learn rate.
I plan to add 0.5 for "don't care."
As for simply adding small increments to weights depending on win/lose. That's a complete nonstarter for several reasons.
First off, which weight and by how much? (only the derivative of the cost function can answer that).
It might work if you had the full 4,531,985,219,092 (see wiki) data samples but I don't see that as an option.
The whole idea of using a NN is to "predict" a winning move for a game board that has never been seen before based on what has already happened.
That's essentially pattern recognition which requires the methodology already explained.
The magnitude of the win is not required, only that is in a direction different to a lose.
A value IS required to determine the amount needed to adjust the weights.
The potential issue with using d_RMS/d_weight to determine that value is that for a multioutput RMS, the sum could potentially be zero if some outputs are too high and some are too low. Though I've never encountered that, I don't have a solution for it.
I'm currently working on a very cutdown version to try out methods of saving/recalling weights and training data.
Probably go for something like single digit to sevensegment display with 137 configuration.
The issue about ReLU remains to be seen. That's part of this excersise.
The only critical thing is that the function needs to return two different gradients depending on polarity.
For zero value inputs, is it really quicker to test for zero and then not do the calculation or simply multiply by zero?
Same for backprop, you need to calculate the gradient anyway, (which is the time consuming bit) and then multiply (possibly by zero).
Might give it a try once I've got some fully working code.
Number for the output choice. That's actually quite easy. "Softmax". As it does not need to be easily human readable or all add up to 100%, the cut down version "rectified mean square". The position that results in a win gets a high value, a lose gets a low value, then train accordingly.
What those values are, depends on the choice of activation function. I'll be using classic sigmoid for the final output.
Tests on other NNs I've done have shown that 0.2 for bad and 0.8 for good yeald the fastest learn rate.
I plan to add 0.5 for "don't care."
As for simply adding small increments to weights depending on win/lose. That's a complete nonstarter for several reasons.
First off, which weight and by how much? (only the derivative of the cost function can answer that).
It might work if you had the full 4,531,985,219,092 (see wiki) data samples but I don't see that as an option.
The whole idea of using a NN is to "predict" a winning move for a game board that has never been seen before based on what has already happened.
That's essentially pattern recognition which requires the methodology already explained.
The magnitude of the win is not required, only that is in a direction different to a lose.
A value IS required to determine the amount needed to adjust the weights.
The potential issue with using d_RMS/d_weight to determine that value is that for a multioutput RMS, the sum could potentially be zero if some outputs are too high and some are too low. Though I've never encountered that, I don't have a solution for it.
I'm currently working on a very cutdown version to try out methods of saving/recalling weights and training data.
Probably go for something like single digit to sevensegment display with 137 configuration.

 Posts: 361
 Joined: Tue Jun 21, 2022 2:17 pm
 Location: South coast UK
 Has thanked: 197 times
 Been thanked: 115 times
Re: Captain's Mistress
I've reconsidered the structure in order to simplify and reduce the number of calculations required in both directions.
Originally, I planned to include move count, and the filled count of each row and column as I thought that might provide information about strategy over and above a simple representation of the current state of play. It was also based on (at the time) not having "option base 1" which left some spare dimensions to the arrays.
I'll continue to size the arrays base 0 with 1 to N being the weight data and element 0 holding the bias value. That way permits a nested for/next from 0 to N which will include the bias as an input instead of treating it as a special case. Training data will be held as individual ASCII files in a human readable format so they can be easily edited with a text editor for test/debug purposes. A custom GUI can come later if/when required.
The files will look like this: (the example is made up and may well be invalid)
0000000
0012010
1121021
2111211
1211211
2121212
ABCDEFG
Essentially a 7x7 grid with player1 represented by a 1 and player2 by a 2. Zero is an unocupied space.
The bottom row is 7 single digit integer variables that represent the move score of the column to play for the next player's move.
These will have a value between 1 and 9 and will be divided by 10 when read into the net.
Player 2 (represented by a 2) will be converted to 1. 2 is only used to make a more readable file.
As described earlier, a value of 0.2 is bad, 0.5 don't care and 0.8 is good. These appear to be values which result in quickest learning.
Those values are determined by playing against the current hardcoded AI.
After each win/lose a file is created giving the state of the board and the score of the last column played as described above.
During training, the file will be read in and the net trained to give the required column number for the pattern seen.
Training does not need to be exact. Anything above 0.5 will be deemed the column to play. Below 0.5, avoid it.
If all are close to 0.5, either pick the highest or one at random. Whatever method, the highest score will be played.
The loss/cost function will be a simple root mean square as it is the simplest function which can produce a derivative in order to calculate the contribution to the error of each weight and thus give a value by which they need to be updated using basic gradient decent method.
Loss = [(target1  prediction1) + (target2  prediction2) + ..... (target7  prediction7)]^2
The 2 left over after calculating the derivative simply gets lost in the learn rate.
This is then back propergated through each function's derivative ending up as a delta to be subtracted from the contributing weight.
The rate of change is then simply the sum of all of the outputs compared to their targets. (as mentioned earlier, this could result as zero but that's very unlikely.)
Still to determine the best training method. Stochastic, batch or minibatch.
Given the limitations of even an ESPS3, I think subminibatch is the only viable option. (currently planned at batch(10))
The final descision will have to wait until after a few trials to see how much a new pattern effects the result of a previous one.
If it's not dramatic, then single file might even be an option, bearing in mind that teaching only needs to go as far as making the right choice and not getting all outputs to within a close tollerance of the specified training values.
Originally, I planned to include move count, and the filled count of each row and column as I thought that might provide information about strategy over and above a simple representation of the current state of play. It was also based on (at the time) not having "option base 1" which left some spare dimensions to the arrays.
I'll continue to size the arrays base 0 with 1 to N being the weight data and element 0 holding the bias value. That way permits a nested for/next from 0 to N which will include the bias as an input instead of treating it as a special case. Training data will be held as individual ASCII files in a human readable format so they can be easily edited with a text editor for test/debug purposes. A custom GUI can come later if/when required.
The files will look like this: (the example is made up and may well be invalid)
0000000
0012010
1121021
2111211
1211211
2121212
ABCDEFG
Essentially a 7x7 grid with player1 represented by a 1 and player2 by a 2. Zero is an unocupied space.
The bottom row is 7 single digit integer variables that represent the move score of the column to play for the next player's move.
These will have a value between 1 and 9 and will be divided by 10 when read into the net.
Player 2 (represented by a 2) will be converted to 1. 2 is only used to make a more readable file.
As described earlier, a value of 0.2 is bad, 0.5 don't care and 0.8 is good. These appear to be values which result in quickest learning.
Those values are determined by playing against the current hardcoded AI.
After each win/lose a file is created giving the state of the board and the score of the last column played as described above.
During training, the file will be read in and the net trained to give the required column number for the pattern seen.
Training does not need to be exact. Anything above 0.5 will be deemed the column to play. Below 0.5, avoid it.
If all are close to 0.5, either pick the highest or one at random. Whatever method, the highest score will be played.
The loss/cost function will be a simple root mean square as it is the simplest function which can produce a derivative in order to calculate the contribution to the error of each weight and thus give a value by which they need to be updated using basic gradient decent method.
Loss = [(target1  prediction1) + (target2  prediction2) + ..... (target7  prediction7)]^2
The 2 left over after calculating the derivative simply gets lost in the learn rate.
This is then back propergated through each function's derivative ending up as a delta to be subtracted from the contributing weight.
The rate of change is then simply the sum of all of the outputs compared to their targets. (as mentioned earlier, this could result as zero but that's very unlikely.)
Still to determine the best training method. Stochastic, batch or minibatch.
Given the limitations of even an ESPS3, I think subminibatch is the only viable option. (currently planned at batch(10))
The final descision will have to wait until after a few trials to see how much a new pattern effects the result of a previous one.
If it's not dramatic, then single file might even be an option, bearing in mind that teaching only needs to go as far as making the right choice and not getting all outputs to within a close tollerance of the specified training values.
Re: Captain's Mistress
I believe that the Mean Squared Error is the sum of squared distances between our target variable and predicted values which would be:
Loss = [(target1  prediction1)^2 + (target2  prediction2)^2 + ..... (target7  prediction7)^2]/7.
The division by seven is unnecessary, so it is just the sum of the squared errors.
Taking the square root would be unnecessary, too.
Loss = [(target1  prediction1)^2 + (target2  prediction2)^2 + ..... (target7  prediction7)^2]/7.
The division by seven is unnecessary, so it is just the sum of the squared errors.
Taking the square root would be unnecessary, too.
Re: Captain's Mistress
It seemed like I was getting the same distribution of the initial weights with each run of the program.
I added X=rnd(0millis) at the start of Initialize_NN, and it looks like that problem is resolved.
I added X=rnd(0millis) at the start of Initialize_NN, and it looks like that problem is resolved.

 Posts: 361
 Joined: Tue Jun 21, 2022 2:17 pm
 Location: South coast UK
 Has thanked: 197 times
 Been thanked: 115 times
Re: Captain's Mistress
A few things where I beg to differ.
max(x,y) is a function (I don't think available in Annex) that returns the larger of x or y.
So max(0.1*sum, sum) will always return sum.
For ReLU it needs to return only positive values. Hence max(0,sum)
For leaky ReLU, the overall goal is to return either the positive value or a small 'leak' when negative.
Either way, in Annex that will be done with a simple if/then statement to return either a 1 for positive gradient or small negative gradient.
So I'm quite sure it should be max(0.01,sum) or probably better max(0.01,sum).
A touch moot as I'll be using sigmoid anyway, at least until something works!
For the loss function, if the inner terms are squared as well, the error would effectively be ^4.
When differentiated, that would return a cube which is not correct. However, it will be interesting to play with variants of the cost function which again is a part of this project. Also, as you are aware, the other bit (1/n) gets lost in the learn rate.
The distribution of weights is deliberately made to give the same values each time so that we always start from the same place.
Experience has shown that NN learning is very largely effected by the initial weights, so if we are trying different methods, we need the same starting point. Also, we need to try different seed values to ensure that successful training was not a fluke of the initial weights.
The last thing we want is a "random" in the equation when doing tests, so the weights will always use the same 'random' sequence.
This will be removed once everything else checks out OK.
If you want a different sequence each time, then simply comment out line 30. [rnd(negative number) gives fixed sequence]
Hope to get some more time on this project soon.
Once I've got some code that can play, generate data, save data and train, then the fun will really start. Playing with different methods that we've already discussed.
Keep the ideas comming. I really have no idea how this will all turn out.
max(x,y) is a function (I don't think available in Annex) that returns the larger of x or y.
So max(0.1*sum, sum) will always return sum.
For ReLU it needs to return only positive values. Hence max(0,sum)
For leaky ReLU, the overall goal is to return either the positive value or a small 'leak' when negative.
Either way, in Annex that will be done with a simple if/then statement to return either a 1 for positive gradient or small negative gradient.
So I'm quite sure it should be max(0.01,sum) or probably better max(0.01,sum).
A touch moot as I'll be using sigmoid anyway, at least until something works!
For the loss function, if the inner terms are squared as well, the error would effectively be ^4.
When differentiated, that would return a cube which is not correct. However, it will be interesting to play with variants of the cost function which again is a part of this project. Also, as you are aware, the other bit (1/n) gets lost in the learn rate.
The distribution of weights is deliberately made to give the same values each time so that we always start from the same place.
Experience has shown that NN learning is very largely effected by the initial weights, so if we are trying different methods, we need the same starting point. Also, we need to try different seed values to ensure that successful training was not a fluke of the initial weights.
The last thing we want is a "random" in the equation when doing tests, so the weights will always use the same 'random' sequence.
This will be removed once everything else checks out OK.
If you want a different sequence each time, then simply comment out line 30. [rnd(negative number) gives fixed sequence]
Hope to get some more time on this project soon.
Once I've got some code that can play, generate data, save data and train, then the fun will really start. Playing with different methods that we've already discussed.
Keep the ideas comming. I really have no idea how this will all turn out.
Re: Captain's Mistress
I'm quite sure that for negative values of sum, 0.01*sum is more positive than sum,
and 0.1*sum would be selected by max(0.1*sum,sum)
Also, I did not suggest squaring both the inner error differences and the sum.
I proposed to only square the inner differences. That will keep a positive difference
in one term from cancelling out the effect of a negative difference in another term.
I didn't notice line 30 before you mentioned it.
I'm just trying to draw your attention to things that may interfere with correct performance of the neural net.
and 0.1*sum would be selected by max(0.1*sum,sum)
Also, I did not suggest squaring both the inner error differences and the sum.
I proposed to only square the inner differences. That will keep a positive difference
in one term from cancelling out the effect of a negative difference in another term.
I didn't notice line 30 before you mentioned it.
I'm just trying to draw your attention to things that may interfere with correct performance of the neural net.

 Posts: 361
 Joined: Tue Jun 21, 2022 2:17 pm
 Location: South coast UK
 Has thanked: 197 times
 Been thanked: 115 times
Re: Captain's Mistress
I'm more than happy for you (or anyone) to point things out. Be they right or wrong, it makes me think. So far you've been mainly right.
I'm pleased that you've taken the time to look so closely.
Now that I've looked closer, I think you may be right about max(0.1*sum,sum) returning 0.1*sum for negatives.
Bit of a brain freeze confusing magnitude and sign
Not sure if others would find it useful, but max(x,y) might be a useful function to include in Annex? I might ask later.
Meanwhile, if/then will do nicely.
Any thoughts about the other stuff. ie the graphics etc.
There are a fair few goto's lurking in there which I'm not overly happy about.
That part was written before elseif was implemented and there was a bug with exit loop.
It could all do with making a little neater.
I'm pleased that you've taken the time to look so closely.
Now that I've looked closer, I think you may be right about max(0.1*sum,sum) returning 0.1*sum for negatives.
Bit of a brain freeze confusing magnitude and sign
Not sure if others would find it useful, but max(x,y) might be a useful function to include in Annex? I might ask later.
Meanwhile, if/then will do nicely.
Any thoughts about the other stuff. ie the graphics etc.
There are a fair few goto's lurking in there which I'm not overly happy about.
That part was written before elseif was implemented and there was a bug with exit loop.
It could all do with making a little neater.
Re: Captain's Mistress
Remember when I wrote:
"I'm beginning to think that the magnitude of the win or loss may not be needed to adjust the weights. The most important information is the polarity (win versus lose) of the result which tells us whether to increase or decrease the weights that contribute to that result. Perhaps we can do without the calculation and multiplication of derivatives altogether and still achieve learning. We could increase or decrease weights by small increments simply based on whether the last move by the AI resulted in a win or a loss." ?
Well, I have implemented a first experiment with that method. It actually does seem to be learning to play a little bit better game.
I have mostly been having AImoveNN for player 1 play against AImoveR for player 0 set to random with no central bias. So the neural net is playing "Beat the Monkey" with the monkey being totally random in its move selection among columns that are not full. I set the random number generator seed to the same value, and then I run hundreds of games with no backpropagation and compare it with results of about the same number of games with backpropagation. My number for comparison is the ratio of player 1 wins divided by player 0 wins.
After about 500 games, AImoveNN, even without backpropagation, beats the monkey by a ratio around 3 or 4 to 1. With backpropagation AImoveNN wins with a ratio that trends up to 11 or 12 to 1.
When AImoveNN is tested against AImoveTL for about 600 games, it loses to AImoveTL with a ratio of about 0.095 with no backpropagation.
It improves its ratio with backpropagation to about 0.22 .
The speed of play is around 25 games per minute, so 600 games is nearly 1/2 hour of running.
I had a leaky ReLU with a slope for negative values of 0.2 .
No sigmoid, no derivatives, no sum of squares.
No monkeys were harmed in this experiment.
Disclaimer: Your monkey's randomness may vary.
"I'm beginning to think that the magnitude of the win or loss may not be needed to adjust the weights. The most important information is the polarity (win versus lose) of the result which tells us whether to increase or decrease the weights that contribute to that result. Perhaps we can do without the calculation and multiplication of derivatives altogether and still achieve learning. We could increase or decrease weights by small increments simply based on whether the last move by the AI resulted in a win or a loss." ?
Well, I have implemented a first experiment with that method. It actually does seem to be learning to play a little bit better game.
I have mostly been having AImoveNN for player 1 play against AImoveR for player 0 set to random with no central bias. So the neural net is playing "Beat the Monkey" with the monkey being totally random in its move selection among columns that are not full. I set the random number generator seed to the same value, and then I run hundreds of games with no backpropagation and compare it with results of about the same number of games with backpropagation. My number for comparison is the ratio of player 1 wins divided by player 0 wins.
After about 500 games, AImoveNN, even without backpropagation, beats the monkey by a ratio around 3 or 4 to 1. With backpropagation AImoveNN wins with a ratio that trends up to 11 or 12 to 1.
When AImoveNN is tested against AImoveTL for about 600 games, it loses to AImoveTL with a ratio of about 0.095 with no backpropagation.
It improves its ratio with backpropagation to about 0.22 .
The speed of play is around 25 games per minute, so 600 games is nearly 1/2 hour of running.
I had a leaky ReLU with a slope for negative values of 0.2 .
No sigmoid, no derivatives, no sum of squares.
No monkeys were harmed in this experiment.
Disclaimer: Your monkey's randomness may vary.

 Posts: 361
 Joined: Tue Jun 21, 2022 2:17 pm
 Location: South coast UK
 Has thanked: 197 times
 Been thanked: 115 times
Re: Captain's Mistress
My monkeys have all gone out on strike.
I've been trying something very similar but with a very reduced number of neurons just to try and prove some points and have it run quicker.
Only 2 input, 3 hidden and the mandatory 7 output. Using sigmoid as the activation.
Rather than have it play games, I've been trying to get it to learn a single state of play and make the correct move.
Without polarity in the loss, it eventually saturates either towards 0 or 1 depending on the initial seed.
With polarity, it starts off OK but eventually sets all the outputs to the AVERAGE of the error.
I had this happen to something similar many years ago but cannot recall the reason. I think it was an error in calculating the rate.
I'll have to go back to my XOR solution to try and find out.
There's only so much you can spew out with wlog but it's a geat way of checking what's going on with the weights.
I've probably got the indexing wrong in the nested loops and updating a neighbour neuron which would make it look like things move in the right direction but will ultimately give the wrong result.
I also think the player turn needs to be part of the equation, otherwise, when looking at a state of play, it is not possible to tell whos turn it is.
Does that even make a difference?
Thanks for your efforts. Looks like you're progressing quicker than I am.
Maybe we should have a playoff. Your NN vs mine. Especially if we use different methods. Maybe a sigmoid vs ReLU?
What do you think?
Have cicciocb host it on his hardware to ensure no cheating. (I'm sure he'd be up for that)
I've been trying something very similar but with a very reduced number of neurons just to try and prove some points and have it run quicker.
Only 2 input, 3 hidden and the mandatory 7 output. Using sigmoid as the activation.
Rather than have it play games, I've been trying to get it to learn a single state of play and make the correct move.
Without polarity in the loss, it eventually saturates either towards 0 or 1 depending on the initial seed.
With polarity, it starts off OK but eventually sets all the outputs to the AVERAGE of the error.
I had this happen to something similar many years ago but cannot recall the reason. I think it was an error in calculating the rate.
I'll have to go back to my XOR solution to try and find out.
There's only so much you can spew out with wlog but it's a geat way of checking what's going on with the weights.
I've probably got the indexing wrong in the nested loops and updating a neighbour neuron which would make it look like things move in the right direction but will ultimately give the wrong result.
I also think the player turn needs to be part of the equation, otherwise, when looking at a state of play, it is not possible to tell whos turn it is.
Does that even make a difference?
Thanks for your efforts. Looks like you're progressing quicker than I am.
Maybe we should have a playoff. Your NN vs mine. Especially if we use different methods. Maybe a sigmoid vs ReLU?
What do you think?
Have cicciocb host it on his hardware to ensure no cheating. (I'm sure he'd be up for that)