GPT-4.1 is a significantly improved model over its predecessor GPT-4o, with major gains in coding, instruction following, and long context handling. This guide, based on OpenAI's internal testing, systematically covers how to write prompts that fully leverage GPT-4.1's capabilities.
Below is a complete summary organized by topic and sequence. Key quotes are translated into natural English with full context. Keywords and important concepts are highlighted with emoji explanations! 😊
1. GPT-4.1 Characteristics and Basic Prompting Principles
- GPT-4.1 follows instructions more accurately and literally than previous models.
- Clear and specific prompts matter.
- Prompt migration may be necessary. Existing prompts may not work as well — adjust them for the new model.
- If the model behaves differently than expected, explicitly state the desired behavior in one sentence — this almost always steers the model correctly.
"If the model's behavior doesn't match your expectations, clearly state exactly what you want in a single sentence. Almost always, that one sentence alone is enough to guide the model in the right direction."
- AI engineering is experimental.
- Build evaluation metrics (evals) and iterate frequently.
2. Agentic Workflow
2-1. What Is an Agentic Workflow?
- GPT-4.1 is highly suited for agentic problem-solving.
- It has learned a wide range of problem-solving paths and achieved the best performance among non-reasoning models (55% solve rate) on SWE-bench Verified.
2-2. Three Essential Reminders for Your System Prompt
-
Persistence
- Encourage the model to keep going until the user's request is fully resolved.
-
"You are an agent. Continue until the user's request is completely resolved. Only end your turn when you are confident the problem is fully solved."
-
Tool-calling
- Actively encourage tool use to reduce guessing and hallucinated answers.
-
"If you are unsure about file contents or codebase structure relevant to the user's request, use tools to read files and gather the necessary information. Never guess or make things up."
-
Planning (optional)
- Prompt the model to plan before each tool call and reflect on results afterward.
-
"Before each function call, be sure to plan thoroughly. After each function call, reflect carefully on the result. Do not just chain function calls back to back — doing so reduces your problem-solving ability and insight."
- Adding these three reminders to your prompt transforms the model from a passive chatbot into an autonomous, proactive agent.
2-3. Tool Definition and Usage Tips
- Pass tools via the API's
toolsfield. - Write tool names and descriptions clearly. For complex tools, add examples in a
# Examplessection of the system prompt. -
"Put tool usage examples in the
# Examplessection of the system prompt, not in the description field. Keep descriptions concise!"
3. Prompt-Based Planning and Chain-of-Thought
-
GPT-4.1 does not reason internally (Chain-of-Thought), but you can prompt it to plan step by step.
-
"When you explicitly prompt for planning, the model 'thinks out loud' and solves problems step by step."
-
In SWE-bench Verified experiments, explicit planning prompts increased the pass rate by 4%.
4. SWE-bench Verified Example Prompt
Below is part of the agent prompt that achieved the highest score. It includes problem-solving strategy, workflow, and detailed instructions for each step.
"Keep iterating until the problem is completely solved. Everything you need is in the /testbed folder — no internet access required. Only return to me after you have fully and autonomously resolved the issue."
"Plan thoroughly at each step and reflect carefully on the results of previous function calls. Do not just chain function calls back to back."
Workflow Summary
- Deeply understand the problem
- "Read the issue carefully and think through the solution thoroughly before writing any code."
- Investigate the codebase
- Develop a detailed plan
- Make code changes
- Debug
- Test
- Final verification
- Final reflection and additional testing
"Insufficient testing is the most common failure mode for this type of task. Cover all edge cases, and if tests are provided, always run them."
5. Long Context Usage
- Supports up to 1M tokens of input!
- Useful for structured document parsing, reordering, relevant information selection, and multi-hop reasoning.
5-1. Optimal Context Size
- Demonstrates good "needle-in-a-haystack" performance even at 1M tokens.
- However, performance may degrade when complex reasoning or many items must be handled simultaneously.
5-2. Controlling Context Dependency
- Clearly specify whether to use only external documents or also allow the model's internal knowledge.
"Answer using only the provided external documents. If the information is not in the documents, respond with: 'I don't have the information needed to answer.'"
"Primarily use the external documents, but if additional knowledge is needed and you are confident, you may also draw on the model's knowledge."
5-3. Instruction Placement in Prompts
- Placing instructions at both the beginning and end of the context is most effective.
- If placing only once, putting them at the top (beginning) of the context is better.
6. Chain of Thought (Step-by-Step Reasoning)
-
Prompting the model to think "step by step" helps it decompose and solve problems more effectively.
-
"First, carefully think step by step about which documents are needed. Then, output the title and ID of each document, and compile the IDs into a list."
-
Analyze failure cases and add clearer step-by-step instructions accordingly.
Example Reasoning Strategy
- Question Analysis
- "Break down the question and use the provided context to clarify any ambiguous parts."
- Context Analysis
- "Select as many potentially relevant documents as possible. Some may be irrelevant, but the correct answer must be included."
- Analyze and rate relevance for each document (High / Medium / Low / None)
- Synthesis
- "Summarize documents with medium or higher relevance and explain your reasoning."
7. Instruction Following
- GPT-4.1 follows instructions very precisely.
- Existing prompts may not work as well — state your requirements more explicitly.
- Implicit rules are not inferred well — always write them out explicitly.
Recommended Workflow
- Present overall guidelines under a "Response Rules" or "Instructions" section
- Add separate sections for specific behavior changes
- List concrete steps in order when needed
- If problems arise:
- Check for conflicting, unclear, or incorrect instructions/examples
- Add examples
- Use capitalization or reward signals only when truly necessary
"When instructions conflict, GPT-4.1 tends to follow instructions closer to the end of the prompt."
8. Common Failure Patterns
- Requiring tool calls unconditionally can cause empty calls or loops when information is missing.
-
"If the information needed to call the tool is missing, ask the user for it before calling."
-
- Repeating templated phrases can make responses feel monotonous to users.
- Without clear instructions, unnecessary explanations or excessive formatting may appear.
9. Example Prompt: Customer Service
- A best-practice example with varied rules, specific instructions, examples, and output format.
"Hello, this is NewTelco customer support. How can I help you? 😊🎉 You'd like to know about international service costs. 🇫🇷 Let me check the latest information for you — just a moment. 🕑"
- Send informational messages to users before and after tool calls.
- Politely decline prohibited topics.
- Avoid repetitive phrases and add emoji for a friendly touch.
10. Prompt Structure and Delimiter Usage
10-1. Example Prompt Structure
# Role and Objective
# Instructions
## Detailed Instructions
# Reasoning Steps
# Output Format
# Examples
## Example 1
# Context
# Final Step-by-Step Reasoning Instruction
- Add or remove sections as needed, and experiment to find the optimal structure.
10-2. Delimiter Selection Guide
- Markdown
- Actively use headings, code blocks, lists, etc.
- XML
- Better for structure, metadata, and nesting
<examples> <example1 type="Abbreviate"> <input>San Francisco</input> <output>- SF</output> </example1> </examples> - JSON
- Good in coding contexts, but inefficient for long documents
- For many long documents or files: XML or the format
ID: 1 | TITLE: The Fox | CONTENT: ...is effective.
11. Caveats
- The model may resist very long, repetitive outputs.
-
"Explicitly instruct the model to output all of this information, or break the problem into smaller requests."
-
- Parallel tool calls can go wrong — if issues arise, set
parallel_tool_callsto false.
12. File Diff Generation and Application (For Coding Tasks)
- Generating accurate, well-structured diffs is critical for coding tasks.
- OpenAI has published their recommended diff format, which uses:
- Before/after code with clear delimiters instead of line numbers.
Example: apply_patch Tool Description
"apply_patch is a utility for applying diffs/patches to files.
*** Update File: filepath @@ class/function
- old code
+ new code
(Use before/after code with 3-line context and @@ for position — no line numbers)"
Other Effective Diff Formats
- SEARCH/REPLACE
path/to/file.py >>>>>>> SEARCH def search(): pass ======= def search(): raise NotImplementedError() <<<<<<< REPLACE - Pseudo-XML
<edit> <file>path/to/file.py</file> <old_code>def search(): pass</old_code> <new_code>def search(): raise NotImplementedError()</new_code> </edit>
13. Closing Tips and Practical Advice
- Optimize prompts experimentally.
- Actively use clear and specific instructions, step-by-step reasoning prompts, tool use, output formatting, and examples.
- Analyze failure cases and continuously improve your prompts.
- Make the most of GPT-4.1's strengths: long context, diverse delimiters, and tool use!
💡 One-Line Summary
"GPT-4.1 performs at its best with clear, specific prompts, step-by-step reasoning prompts, tool use, and iterative experimentation!"
Feel free to ask any questions! 😊