Google Debuts Gemini 3.1 Pro with Significant Reasoning Enhancements

TL;DR

Gemini 3.1 Pro scores 77.1% on ARC-AGI-2 logic testing.
The model maintains a 1 million token context window and expands its output capacity to 65,000 tokens.
A new custom tools endpoint enhances file operations and coding agents.
A preview of the model is now available across the Gemini app, Vertex AI, and developer tools.

Google has launched Gemini 3.1 Pro, an updated model engineered to enhance complex reasoning, planning, and tool utilization across both consumer and enterprise services. The company stated that this new model more than doubles the ARC-AGI-2 score achieved by Gemini 3 Pro, demonstrating superior performance in tasks requiring problem-solving over simple text generation. This update is currently being rolled out to the Gemini app, Vertex AI, NotebookLM, and through various developer tools.

Gemini 3.1 Pro has achieved a verified score of 77.1% on the ARC-AGI-2 benchmark. This benchmark assesses a model’s capability to reason through novel logic patterns that are not present in its training data. Google indicated that this advancement is beneficial for agent-driven workloads, which rely on consistent long-form reasoning across multiple task steps.

Today, we’re continuing to push the boundaries of AI with our release of Gemini 3.1 Pro.

This updated model scores 77.1% on ARC-AGI-2, more than double the reasoning performance of its predecessor, Gemini 3 Pro.

Check out the visible improvement in this side-by-side comparison,…

— Jeff Dean (@JeffDean)

This release follows last week’s Gemini 3 Deep Think update, which was tailored for scientific and engineering applications. Google mentioned that the new model builds upon that work while providing broader access to developers and enterprise users.

Gemini 3.1 Pro Expanded Context Window and Output Capacity

Gemini 3.1 Pro supports an input context window of one million tokens. This enables users to incorporate entire code repositories, research datasets, or extensive documents within a single request. Google noted that the model can maintain stable reasoning across different files and data segments, even when the content spans hundreds of thousands of tokens.

Additionally, the model features a 65,000 token output window. This facilitates long-form content generation, such as technical manuals, structured reports, or multi-file code outputs. Google stated that this expanded output capacity reduces task fragmentation, as large outputs can be completed in a single response.

The company highlighted that these enhancements are designed to support developers building autonomous agents, which often require the ability to read large collections of files, navigate complex directory structures, or generate lengthy technical results.

Improved Benchmarks Across Logic, Coding, and Science

Google reported performance improvements across several internal and external benchmarks. The model achieved a 94.1% score on GPQA Diamond, a test for scientific reasoning. It reached 92.6% on MMMLU for multimodal understanding. The model also demonstrated strong results on coding assessments, including SWE-Bench Verified and LiveCodeBench Pro.

The company attributed these gains to refinements in how the model allocates reasoning tokens. This structural adjustment is intended to minimize errors during long-horizon tasks and produce more consistent outputs across dependent steps.

Google indicated that the model is capable of handling scientific workflows that require grounded reasoning or calculations. It can also assist engineering teams needing robust code generation and complex debugging capabilities.

New Tools and Updated Agent Workflows

With this release, Google has introduced a specialized endpoint named `gemini-3.1-pro-preview-customtools`. This endpoint is optimized for developers utilizing file system navigation, code search, and structured tool calls. The model is fine-tuned to prioritize local tools, thereby reducing the likelihood of unnecessary external searches.

The update also integrates with Google Antigravity, the company’s platform for agent development. Developers can set a “medium” thinking level for tasks that require a balance between depth and latency. Google explained that this option helps teams manage reasoning budgets while maintaining accuracy.

The Interactions API has also undergone a change: the field `total_reasoning_tokens` has been renamed to `total_thought_tokens`. Google stated that this modification supports thought signatures, which preserve reasoning context for multi-turn workflows.

Pricing, Access, and Deployment Across Google Products

The pricing for Gemini 3.1 Pro Preview remains consistent with the previous model. Input tokens are priced at $2 per million for prompts under 200,000 tokens and $4 per million for larger prompts. Output tokens are priced at $12 per million for shorter prompts and $18 per million for longer prompts. Context caching continues to be available for workloads requiring repeated calls.

The model is accessible via the Gemini API, Google AI Studio, Android Studio, and the Gemini CLI. Enterprise users can access the model through Vertex AI and Gemini Enterprise. Consumers can utilize the model within the Gemini app and NotebookLM, with increased limits for paid subscribers.

Google indicated that the preview period will enable the company to refine model behavior and safety features prior to general availability. The company added that Gemini 3.1 Pro is positioned as a foundational element for agentic AI systems that need to reason through extensive tasks and operate within complex environments.

TL;DR

Gemini 3.1 Pro Expanded Context Window and Output Capacity

Improved Benchmarks Across Logic, Coding, and Science

New Tools and Updated Agent Workflows

Pricing, Access, and Deployment Across Google Products

Internationally Best-Selling Visual Artist and Gallery Owner Oriana Gerez Continues Nationwide Expansion

Strategy (MSTR) Shares Gain 3.7% Following $1.28 Billion Bitcoin Purchase

Atua AI Unveils Real-Time Coordination Tools to Ensure Dependable Multichain Execution