February 06, 2020
Google's people have successfully developed AI to predict which machine learning models can produce the best results. A team of Google AI researchers suggests what they call "off-policy classification," or OPC, in a newly published paper ("Off-Policy Evaluation by Off-Policy Classification") and blog post, which evaluates the functioning of AI-driven members by treating creation as a classification issue.
The team reports that their method— a type of reinforcement learning that uses advantages to encourage machine policies against targets — operates with image inputs and scales to tasks, such as vision-based robotic grasp.
"Fully off-policy reinforcement learning is a version in which an agent learns entirely from older data, which can be attractive because it enables version replication without having a physical process," writes Robotics at Google software engineer Alex Irpan. "An individual may train multiple versions on the same fixed dataset collected by previous representatives with fully off-policy RL, then choose the best one."
Arriving at OPC was a little more challenging than it would appear. As Irpan and fellow coauthors notice, learning off-policy reinforcement enables AI model training along with, say, a robot but not a test. They point out a ground-truth test is usually ineffectual.
Their solution— OPC— tackles this by believing that positions at hand have little-to-no randomness involved in how states shift and from assuming that after experimental trials members either fail or succeed. The binary nature of the second of the 2 assumptions allowed the mission to every behavior of two classification labels ("successful" for success or "catastrophic" for failures).
In addition, OPC relies on what is called a Q-function (learned using a Q-learning algorithm) to gage the possible cumulative incentives for acts. Agents choose actions along with the greatest expected benefits, and their success is calculated by how often the actions are chosen are efficient (which depends on how well the Q method properly classifies actions as successful versus catastrophic). The accuracy of the classification acts as an evaluation score outside of policy.
The team trained machine learning policies using fully off-policy reinforcement learning in simulation and then tested them using the off-policy scores tabulated from previous real-world data. In particular, they note that one variant of OPC— SoftOPC— has performed best in predicting success rates in a robot grasp job. Because of 15 models of varying robustness (seven of which were trained in pure simulation), SoftOPC produced scores that were carefully correlated with accurate grasp performance and "considerably" more reliable than baseline procedures.
The researchers intend to look for positions with "noisier" and nonbinary complexities in future work. "[ W]e think the results are sufficiently promising to be extended to a lot of real-world RL problems," Irpan wrote.