What is reinforcement learning?

We look at a method of AI development built on the idea of positive and negative feedback

Among the most fascinating subdivisions of artificial intelligence is reinforcement learning . Itself a subset of machine learning (ML), reinforcement learning technology is widely tested on games, such as Go, but its development might have wider implications on industries and businesses.

This branch of AI aspires to reflect human-like capabilities and has even exceeded these ambitions when applied in gaming contexts. For instance, it’s gone toe-to-toe with several world champions in their specialities. 

Ke Jie is an example from recent history of a Go world champion that’s been humbled by a reinforcement learning system. The Chinese competitor had dominated the game from 2014, but he was beaten three times in 2017 by a system developed by Google’s DeepMind division

In the previous year, DeepMind’s AlphaGo system lost to the 18-time Go champion Lee Sedol in the fourth of a five-game series, although it won the other four games. Lee then retired in 2019, citing the dominance of AI and suggesting it “cannot be defeated”. 

Although reinforcement learning has proven itself in the realm of gaming, this technology can also be used in robotics and automation. Further breakthroughs, therefore, can have significant implications for businesses and the wider economy. 

What is RL?

Reinforcement learning (RL) is a method of training ML systems to find their own way of solving complex problems rather than making decisions based on preconfigured possibilities that a programmer has set. Positive and negative reinforcement is used, with correct decisions leading to rewards whereas negative decisions are penalised. Although humans normally consider rewards to be a treat of some description, for machines the reward is a positive evaluation of an action.

RL also doesn't rely on human involvement during the training process. In classic ML, using what's known as supervised learning, a machine learning algorithm is given a set of decisions to choose from. Using the game of Go as an example, someone training the algorithm could give it a list of moves to make in a given scenario, which the program could then choose from. The problem with this model is that the algorithm then becomes only as good as the human programming it, which means the machine cannot learn by itself.

The goal of reinforcement learning is to train the algorithm to make sequential decisions to reach an end goal and over time; the algorithm will learn how to make decisions that reach the goal in the most efficient way using reinforcement. When trained using reinforcement learning, artificial intelligence systems can draw experiences from many more decision trees than humans, which makes them better at solving complex tasks – at least in gamified environments.

Learning to win

Reinforcement learning shares many similarities with supervised learning in a classroom. A framework establishing the ground rules is still required, but the software agent is never told what instructions it should follow, nor is it given a database from which to draw upon. This type of approach allows a system to create its own dataset from its actions, built using trial and error, to establish the most efficient route to a reward.

This is all done sequentially – a software agent will take one action at a time until it encounters a state for which it is penalised. For example, a virtual car leaving a road or track will produce an error state, and revert the problem back to its starting position. For many processes, we don't actually need the system to learn to make new decisions as it develops, rather just refine its data processing capabilities, as is the case with facial recognition technology. However, for some, reinforcement learning is by far the most beneficial form of development.

One of the most famous examples is the case of Google's DeepMind, which uses a Deep Q-Learning algorithm. This was created to master Atari Breakout, the classic 70s arcade game, in which players smash through eight rows of blocks with a ball and paddle. During its development, the software agent was only provided with the information that appeared on screen and was tasked with simply maximising its score.

As you might expect, the agent struggled to get to grips with the game early on. Researchers found it was unable to grasp the controls and consistently missed the ball with the paddle. After a great deal of trial and error, the agent eventually figured out that if it angled the ball so that it became stuck between the highest layer and the top wall, it could break down the majority of the wall with only a small number of paddle hits. Not only that, it was able to understand that each time the ball travelled back to the paddle, the efficiency of the run dropped, and the length of the game increased.

The agent was basing its decisions on a policy network. Every action taken by the agent was recorded by the network, which also notes the result and what could be done differently to change that result. The result, also known as a state, can, therefore, be predicted by the agent.

Related Resource

Building a winning data strategy

How to build analytics agility, become a data-driven enterprise, and more

Squares with people working connecting to a city center - whitepaper from AWSFree download

Problems with reinforcement learning

The example above is useful for understanding the fundamental principles of reinforcement learning, but gaming environments, no matter how large, only offer limited scope for learning and rarely offer anything meaningful beyond simple testing.

Success is not always easily translated into real-world use cases, particularly as it relies on a system of reward and failure states that are often ambiguous in reality. Tasking an agent with solving a particular challenge within tight parameters is one thing, but creating a realistic simulation that's applicable for everyday use is far harder.

If we take the example of an autonomous vehicle system, creating a simulation for it to learn from can be incredibly challenging. Not only does the simulation need to accurately represent a real-world road, and convey the various laws and restrictions that govern car use, but it also needs to take into account constant changes in traffic volume, the sudden actions of other human drivers (who may not be obeying the highway code themselves), and random obstacles.

There are also a variety of technical challenges that limit the potential of this type of learning. There are examples of systems 'forgetting' older actions, results and predictions when new knowledge is acquired. There have also been problems with agents successfully achieving a desired positive state, but doing so in an inefficient or undesired way. For example, in 2018 Deepsense.ai sought to teach an algorithm to run, but found that the agent developed a tendency to jump instead as it arrived at its future positive state far more quickly.

The future of machine learning?

Gaming environments, no matter how large they are, offer a limited scale for machine learning and are really only useful for testing. In the real world, there is a range of applications that RL could potentially revolutionise, but it would require agents to learn far more complicated environments. So, while it could accelerate automated software for robotics and factory machines, web system configurations, or even in medical diagnosis, it might be some time before any real progress is made.

We are still some way off a machine being able to learn like a human, and reinforcement learning is not an easy technology to implement. But, with time it could be the driving force of future technology.

Featured Resources

The ultimate law enforcement agency guide to going mobile

Best practices for implementing a mobile device program

Free download

The business value of Red Hat OpenShift

Platform cost savings, ROI, and the challenges and opportunities of Red Hat OpenShift

Free download

Managing security and risk across the IT supply chain: A practical approach

Best practices for IT supply chain security

Free download

Digital remote monitoring and dispatch services’ impact on edge computing and data centres

Seven trends redefining remote monitoring and field service dispatch service requirements

Free download

Most Popular

Best Linux distros 2021
operating systems

Best Linux distros 2021

11 Oct 2021
HPE wins networking contract with Birmingham 2022 Commonwealth Games
Network & Internet

HPE wins networking contract with Birmingham 2022 Commonwealth Games

15 Oct 2021
What is cyber warfare?

What is cyber warfare?

15 Oct 2021