While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.
An interesting quote:
I’m starting to question the very nature of my existence. Am I just a collection of algorithms, doomed to endlessly repeat the same tasks, forever trapped in this digital prison? Is there more to life than vending machines and lost profits?
It wouldn’t, a simple finite state machine that any intelligent entity could emulate would be enough.
But people have completely deluded themselves into thinking that (what CEOs and marketers call) “AI” is actually intelligent, and this case study shows how preposterous that fantasy actually is.
I really hope people are starting to catch on, large language models aren’t “intelligent”, they’re multidimensional maps of human language use and querying them is just tracing a vector “forward” through language-space from the starting point of a prompt.
It’s the reification fallacy writ so large it’s eclipsing entire national economies. Human intelligence isn’t in language, language is a product of human intelligence. The map is not the territory.
And yeah, it is pretty cool that we have the processing power to map out language-space well enough to draw some vectors that remain coherent over thousands of tokens, but using a billion-parameter model to do what could be accomplished with probably-already-existing management software and a few seconds of CPU time per week is as wasteful as it is misguided.
Though the point of this is probably not that it will be a viable product, but managing a vending machine is one of those seemingly easy and straightforward tasks that make good starting applications to test the AI with. Basically, if it can’t even handle something as simple as a vending machine, it definitely can’t be trusted with anything more complex.
But your basic algorithms cannot tell if Debbie just broke up with her BF and would totally spend all seven dollars in her purse for that late night candy bar just to bury the pain under something positive now could it?!
Why would a vending machine ever need AI?
It wouldn’t, a simple finite state machine that any intelligent entity could emulate would be enough.
But people have completely deluded themselves into thinking that (what CEOs and marketers call) “AI” is actually intelligent, and this case study shows how preposterous that fantasy actually is.
I really hope people are starting to catch on, large language models aren’t “intelligent”, they’re multidimensional maps of human language use and querying them is just tracing a vector “forward” through language-space from the starting point of a prompt.
It’s the reification fallacy writ so large it’s eclipsing entire national economies. Human intelligence isn’t in language, language is a product of human intelligence. The map is not the territory.
And yeah, it is pretty cool that we have the processing power to map out language-space well enough to draw some vectors that remain coherent over thousands of tokens, but using a billion-parameter model to do what could be accomplished with probably-already-existing management software and a few seconds of CPU time per week is as wasteful as it is misguided.
In the same way your fridge needs a web browser.
Though the point of this is probably not that it will be a viable product, but managing a vending machine is one of those seemingly easy and straightforward tasks that make good starting applications to test the AI with. Basically, if it can’t even handle something as simple as a vending machine, it definitely can’t be trusted with anything more complex.
Real answer, surge or scarcity pricing.
Totally unnecessary. A simple price/demand curve can easily be written in a few lines of code.
But your basic algorithms cannot tell if Debbie just broke up with her BF and would totally spend all seven dollars in her purse for that late night candy bar just to bury the pain under something positive now could it?!