How to deal with a breakdown in theoretical support in machine learning? Researchers from Carnegie Mellon and Facebook describe winning many hands against the world’s top poker players by inventing smart search strategies to counter a lack of theoretical math used in most game-playing AI.
Among the many achievements of machine learning in recent years, some of the most striking are the victories of the machine against human players in games, such as Google’s DeepMind group’s conquest of Go in 2016. In such milestones, researchers are often guided by theoretical math that says there can be an optimal strategy to be found, given a good algorithm and enough compute.
But what do you do when theory breaks down? Two researchers at Carnegie Mellon University and Facebook went back to the drawing board to solve “heads-up no-limit Texas hold’em,” the most popular form of multiplayer poker in the world.
Theory isn’t computable for this form of the card game, so they designed some elegant search strategies for their computer program, “Pluribus,” to beat the best human players in 10,000 hands of poker. The authors even managed to do it with a single, 64-core Intel-based server, with just 512 gigabytes of RAM, which they point out is far less compute than increasingly gigantic machine learning models such as DeepMind’s “AlphaZero” that use tons of computing to solve things.
Rather than computing optimal solutions across players, the Pluribus program searches for good enough solutions that turn out to perform surprisingly well.
The work, “Superhuman AI for multiplayer poker,” describing competition over twelve days against top world players at poker, is published today in Science magazine and is written by Noam Brown and Tuomas Sandholm. Brown and Sandholm both have affiliations with Carnegie Mellon University; Brown is also with Facebook AI Research, and Sandholm has affiliations with three Pittsburgh companies, Strategic Machine, Inc., Strategy Robot, Inc., and Optimized Markets, Inc
Science magazine has become something of a hotbed for cutting-edge poker papers by machine learning types, and this is the second appearance by Brown and Sandholm in a little over a year. In January of last year, they published a machine learning model called “Libratus” that could achieve “superhuman” ability in two-player versions of Texas hold’em poker.
Brown and Sandholm’s real-time search strategy for Pluribus in the thick of Texas hold’em.
With Pluribus, the authors take on a new level of complexity that comes with multiple opponents; in this case, five humans against the Pluribus machine. In most games taken on by machine learning, including Go and two-player poker, there is a theoretical framework that forms the basis for finding optimal playing strategies. The “Nash Equilibrium,” named for famed US mathematician John Nash, says that optimal playing strategies can be found for each player based on the assumption every opponent in a game is equally playing their best strategy.
In a simple game like rocks, paper, scissors, just playing the same choice every round, such as rocks, can be the optimal strategy leading to equilibrium between players.
So making bots that play games can in some sense be boiled down to building a machine that computes the Nash Equilibrium.
The problem is, as games increase in complexity, finding the Nash Equilibrium becomes more and more computationally intense. Approximating that equilibrium is the best computers can do within practical time limits. It’s worked well for a number of approaches, and, in particular, in two-player heads up poker, it was an approach that served Brown and Sandholm well with Libratus, as it did another team, Moravčik and colleagues at the University of Alberta, who published their “DeepStack” machine for Texas hold’em in Science in 2017.
But in multi-player Texas hold’em poker, the Nash Equilibrium becomes intractable computationally. As the authors write, “Even approximating a Nash equilibrium is hard (except in special cases) in theory, and in games with more than two players, even the best complete algorithm can only address games with a handful of possible strategies per player.”
So, Brown and Sandholm had to suffice with an approach to machine learning that is “not guaranteed to converge to a Nash equilibrium.” It’s a venture into the unknown in a sense, but one that wins nevertheless: “[E]ven though the techniques do not have known strong theoretical guarantees on performance outside of the two-player zero-sum setting, they are nevertheless capable of producing superhuman strategies in a wider class of strategic settings.”
Pluribus uses a familiar approach to Libratus and DeepStack in its training of the machine, something called “counterfactual regret minimization,” or CFR. In the context of rounds of poker, where actions include a call, a raise, or fold, CFR computes at each moment of action which would have been better to play, by having the computer play against itself and analyze as it proceeds how good or bad outcomes were. Training the machine amounts to constructing a “blueprint” of what moves are high-value given any current state of the game. The machine keeps refining this blueprint as it proceeds through branching rounds of moves in game after game.
But where the action happens, where the theory breaks down, is in live play against humans. Because poker, unlike chess or go, is a game of “imperfect information” — the opponent’s cards are hidden — computing a Nash Equilibrium won’t work because opponents can employ different strategies at each move, so there’s no way for the machine to look ahead as it might in chess. As the authors put it in technical terms, “in imperfect-information subgames (the part of the game in which search is being conducted), leaf nodes do not have fixed values.”
To solve that problem, Brown and Sandholm came up with what they deem a superior search strategy for Pluribus to compute during the game. The machine searches over multiple possible changes in strategy by opponents, assuming four possibilities: The opponent sticks with what was the strategy up to that point, or they pursue one of three possible strategies that are biased toward a call, a raise or folding.
They also had the machine compute what the other players might assume Pluribus has in its hand based on the game action up to that point in time. That was a way to make Pluribus vary its strategy to throw opponents off balance. As the authors write, “Regardless of which hand Pluribus is holding, it will first calculate how it would act with every possible hand, being careful to balance its strategy across all the hands to remain unpredictable to the opponent.”
The approach to search in this regard differs from the way other outfits are contending with the difficulty of Nash Equilibrium. For example, DeepMind, in crafting the “AlphaStar” machine that took on human players of the strategy video game Starcraft, couldn’t rely on the same techniques that DeepMind’s AlphaZero used. In chess and go, one can approximate the Nash Equilibrium because one can assume two opponents who each optimize their respective strategies while taking into account the other person’s optimization. But Starcraft is what’s called a non-transitive game, meaning that there is no consistent opponent against which to optimize. The way, DeepMind dealt with that was to expand the search “space,” if you will, in which to look for best moves, what’s known as the “polytope.”
Expanding a search space in that fashion ultimately brings more computing demands, and Brown and Sandholm are proud that their work minimizes the actual compute needs. While the training of Pluribus was done on a 64-core server, the live gameplay against humans was performed on a machine with “two Intel Haswell E5- 2695 v3 CPUs and uses less than 128 GB of memory.”
As they note in the blog post, the cost to train would be about $150 in cloud computing facilities. “This is in sharp contrast to other recent AI breakthroughs, including those involving self-play in games, which commonly cost millions of dollars to train,” they write.
It turns out smart search techniques like this did well in practice, even without converging on a Nash Equilibrium. The authors don’t disclose how many out of the 10,000 hands Pluribus won against humans. But in a follow-up email with ZDNet, Noam Brown explained that the important point is the machine’s average winnings.
Pluribus “won convincingly over the long haul,” is how Brown described it.
“Another way to think of the bot’s performance is in terms of what it would have won if it were playing against these human pros for money,” Brown continued. “Playing thousands of hands against some of the world’s best human professionals — each of whom has won at least $1 million playing poker — the bot came out ahead, earning the equivalent of about $1,000 per hour.”