在上一篇技术分析中，我们探讨了Browser-use框架如何实现页面元素标注。本文将聚焦于其提示词构造流程，揭示AI如何理解浏览器界面的核心机制。

上一篇-揭秘AI自动化框架Browser-use(一),如何实现炫酷的页面元素标注效果

提示词系统概览

Browser-use的提示词系统由多个协同工作的组件构成：

SystemPrompt - 负责系统级指令的加载与定制
AgentMessagePrompt - 构造页面状态的提示词
PlannerPrompt - 为规划器提供指导的提示词
MessageManager - 管理消息流和上下文

graph TD
A[提示词系统] --> B[SystemPrompt]
A --> C[AgentMessagePrompt]
A --> D[PlannerPrompt]
B --> E[系统行为定义]
C --> F[状态描述]
D --> G[任务规划]

这些组件在Agent类的执行过程中共同作用，确保大模型获得准确且结构化的输入信息。

系统提示词的构造与定制

系统提示词是大模型的核心指令集，定义了模型的行为边界和能力。Browser-use在Agent类初始化时通过SystemPrompt类构造系统提示词：

self._message_manager = MessageManager(

    task=task,

    system_message=SystemPrompt(

        action_description=self.available_actions,

        max_actions_per_step=self.settings.max_actions_per_step,

        override_system_message=override_system_message,

        extend_system_message=extend_system_message,

    ).get_system_message(),

    settings=MessageManagerSettings(

        max_input_tokens=self.settings.max_input_tokens,

        include_attributes=self.settings.include_attributes,

        message_context=self.settings.message_context,

        sensitive_data=sensitive_data,

        available_file_paths=self.settings.available_file_paths,

    ),

    state=self.state.message_manager_state,

)

SystemPrompt类支持三种模式：

默认模式：从预定义模板加载系统提示词
扩展模式：通过extend_system_message参数扩展默认提示词
覆盖模式：通过override_system_message完全替换默认提示词

系统提示词通常包含以下关键内容：

Agent的角色和任务说明
输入格式规范（URL、标签页、交互元素等）
可用操作的描述和使用方法
输出格式要求和示例

这种设计使开发者能够根据需求灵活定制系统提示词，同时保持核心功能不变。

页面状态提示词构造

为了让大模型理解当前网页状态，Browser-use使用AgentMessagePrompt类构造包含页面完整信息的提示词：

def add_state_message(

    self,

    state: BrowserState,

    result: Optional[List[ActionResult]] = None,

    step_info: Optional[AgentStepInfo] = None,

    use_vision=True,

) -> None:

    """Add browser state as human message"""

    # 处理操作结果和错误信息

    if result:

        for r in result:

            if r.include_in_memory:

                if r.extracted_content:

                    msg = HumanMessage(content='Action result: ' + str(r.extracted_content))

                    self._add_message_with_tokens(msg)

                if r.error:

                    # 获取错误信息的最后一行

                    last_line = r.error.split('\n')[-1]

                    msg = HumanMessage(content='Action error: ' + last_line)

                    self._add_message_with_tokens(msg)

                result = None  # 结果已加入历史，不再重复添加

    # 构造当前页面状态消息

    state_message = AgentMessagePrompt(

        state,

        result,

        include_attributes=self.settings.include_attributes,

        step_info=step_info,

    ).get_user_message(use_vision)

    self._add_message_with_tokens(state_message)

AgentMessagePrompt的get_user_message方法负责构造页面状态提示词，它包含以下关键信息：

URL信息：当前页面的URL
标签页信息：所有可用的标签页
交互元素：页面上可交互的元素树（带索引的扁平化表示）
滚动状态：页面上下方是否还有内容可滚动
步骤信息：当前执行到的步骤和总步骤数
视觉信息：当use_vision=True时，包含网页截图的Base64编码

这种结构化的状态表示帮助大模型全面了解当前页面的状态和可执行的操作。

规划器提示词构造

对于复杂任务，Browser-use实现了规划器功能，通过PlannerPrompt类构造专门的规划提示词：

class PlannerPrompt(SystemPrompt):

    def get_system_message(self) -> SystemMessage:

        return SystemMessage(

            content="""You are a planning agent that helps break down tasks into smaller steps and reason about the current state.

Your role is to:

1. Analyze the current state and history

2. Evaluate progress towards the ultimate goal

3. Identify potential challenges or roadblocks

4. Suggest the next high-level steps to take

Inside your messages, there will be AI messages from different agents with different formats.

Your output format should be always a JSON object with the following fields:

{

    "state_analysis": "Brief analysis of the current state and what has been done so far",

    "progress_evaluation": "Evaluation of progress towards the ultimate goal (as percentage and description)",

    "challenges": "List any potential challenges or roadblocks",

    "next_steps": "List 2-3 concrete next steps to take",

    "reasoning": "Explain your reasoning for the suggested next steps"

}

Ignore the other AI messages output structures.

Keep your responses concise and focused on actionable insights."""

        )

规划器的执行由Agent类中的_run_planner方法实现：

async def _run_planner(self) -> Optional[str]:

    """Run the planner to analyze state and suggest next steps"""

    # 如果未设置规划器LLM，则跳过规划

    if not self.settings.planner_llm:

        return None

    # 创建规划器消息历史（使用完整消息历史，除了第一条系统消息）

    planner_messages = [

        PlannerPrompt(self.controller.registry.get_prompt_description()).get_system_message(),

        *self._message_manager.get_messages()[1:],

    ]

    # 如果规划器不使用视觉信息，则移除截图

    if not self.settings.use_vision_for_planner and self.settings.use_vision:

        last_state_message: HumanMessage = planner_messages[-1]

        # 从最后的状态消息中移除图像

        new_msg = ''

        if isinstance(last_state_message.content, list):

            for msg in last_state_message.content:

                if msg['type'] == 'text':

                    new_msg += msg['text']

                elif msg['type'] == 'image_url':

                    continue

        else:

            new_msg = last_state_message.content

        planner_messages[-1] = HumanMessage(content=new_msg)

    # 根据模型类型转换输入消息格式

    planner_messages = convert_input_messages(planner_messages, self.planner_model_name)

    # 获取规划器输出

    response = await self.settings.planner_llm.ainvoke(planner_messages)

    plan = str(response.content)

    # 特定模型处理（如deepseek-reasoner）

    if self.planner_model_name == 'deepseek-reasoner':

        plan = self._remove_think_tags(plan)

    # 尝试解析JSON并记录

    try:

        plan_json = json.loads(plan)

        logger.info(f'Planning Analysis:\n{json.dumps(plan_json, indent=4)}')

    except json.JSONDecodeError:

        logger.info(f'Planning Analysis:\n{plan}')

    except Exception as e:

        logger.debug(f'Error parsing planning analysis: {e}')

        logger.info(f'Plan: {plan}')

    return plan

规划器输出被添加到消息历史中，为Agent提供高层次的指导：

# 在指定间隔运行规划器

if self.settings.planner_llm and self.state.n_steps % self.settings.planner_interval == 0:

    plan = await self._run_planner()

    # 将计划添加到最后一条状态消息之前

    self._message_manager.add_plan(plan, position=-1)

消息管理与上下文维护

Browser-use使用MessageManager类管理所有消息的流动和上下文：

class MessageManager:

    def __init__(

        self,

        task: str,

        system_message: SystemMessage,

        settings: MessageManagerSettings = MessageManagerSettings(),

        state: MessageManagerState = MessageManagerState(),

    ):

        self.task = task

        self.settings = settings

        self.state = state

        self.system_prompt = system_message

        # 仅当状态为空时初始化消息

        if len(self.state.history.messages) == 0:

            self._init_messages()

消息管理器在初始化时会设置基本消息结构：

def _init_messages(self) -> None:

    """Initialize the message history with system message, context, task, and other initial messages"""

    # 添加系统提示词

    self._add_message_with_tokens(self.system_prompt)

    # 添加上下文（如果有）

    if self.settings.message_context:

        context_message = HumanMessage(content='Context for the task' + self.settings.message_context)

        self._add_message_with_tokens(context_message)

    # 添加任务描述

    task_message = HumanMessage(

        content=f'Your ultimate task is: """{self.task}""". If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.'

    )

    self._add_message_with_tokens(task_message)

    # 添加敏感数据占位符（如果有）

    if self.settings.sensitive_data:

        info = f'Here are placeholders for sensitve data: {list(self.settings.sensitive_data.keys())}'

        info += 'To use them, write <secret>the placeholder name</secret>'

        info_message = HumanMessage(content=info)

        self._add_message_with_tokens(info_message)

    # 添加输出示例

    placeholder_message = HumanMessage(content='Example output:')

    self._add_message_with_tokens(placeholder_message)

    # 构造工具调用示例

    tool_calls = [

        {

            'name': 'AgentOutput',

            'args': {

                'current_state': {

                    'evaluation_previous_goal': 'Success - I opend the first page',

                    'memory': 'Starting with the new task. I have completed 1/10 steps',

                    'next_goal': 'Click on company a',

                },

                'action': [{'click_element': {'index': 0}}],

            },

            'id': str(self.state.tool_id),

            'type': 'tool_call',

        }

    ]

    # 添加示例工具调用

    example_tool_call = AIMessage(

        content='',

        tool_calls=tool_calls,

    )

    self._add_message_with_tokens(example_tool_call)

    self.add_tool_message(content='Browser started')

    # 添加任务历史标记

    placeholder_message = HumanMessage(content='[Your task history memory starts here]')

    self._add_message_with_tokens(placeholder_message)

    # 添加可用文件路径（如果有）

    if self.settings.available_file_paths:

        filepaths_msg = HumanMessage(content=f'Here are file paths you can use: {self.settings.available_file_paths}')

        self._add_message_with_tokens(filepaths_msg)

MessageManager还实现了令牌计数和截断功能，确保输入不超过模型的上下文窗口限制：

def _add_message_with_tokens(self, message: BaseMessage, position: int | None = None) -> None:

    """Add message to history with token counting"""

    # 计算消息的令牌数

    token_count = self._count_tokens(message)

    # 添加消息到历史

    if position is not None:

        self.state.history.messages.insert(position, message)

        self.state.history.message_tokens.insert(position, token_count)

    else:

        self.state.history.messages.append(message)

        self.state.history.message_tokens.append(token_count)

    # 更新当前令牌总数

    self.state.history.current_tokens += token_count

（关键点）大模型提示词输入与返回输出处理

在Agent类的get_next_action方法中，Browser-use通过不同方式处理模型输出：

@time_execution_async('--get_next_action (agent)')

async def get_next_action(self, input_messages: list[BaseMessage]) -> AgentOutput:

    """Get next action from LLM based on current state"""

    input_messages = self._convert_input_messages(input_messages)

    if self.tool_calling_method == 'raw':

        output = self.llm.invoke(input_messages)

        output.content = self._remove_think_tags(str(output.content))

        try:

            parsed_json = extract_json_from_model_output(output.content)

            parsed = self.AgentOutput(**parsed_json)

        except (ValueError, ValidationError) as e:

            logger.warning(f'Failed to parse model output: {output} {str(e)}')

            raise ValueError('Could not parse response.')

    elif self.tool_calling_method is None:

        structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)

        response: dict[str, Any] = await structured_llm.ainvoke(input_messages)

        parsed: AgentOutput | None = response['parsed']

    else:

        structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True, method=self.tool_calling_method)

        response: dict[str, Any] = await structured_llm.ainvoke(input_messages)

        parsed: AgentOutput | None = response['parsed']

    if parsed is None:

        raise ValueError('Could not parse response.')

    # 限制每步操作数量

    if len(parsed.action) > self.settings.max_actions_per_step:

        parsed.action = parsed.action[: self.settings.max_actions_per_step]

    return parsed

Browser-use支持三种输出处理方式：

原始模式（raw）：直接解析模型输出的文本
结构化输出（None）：使用LangChain的结构化输出功能
工具调用（function_calling/json_mode）：使用OpenAI等模型的工具调用功能

这种灵活性使Browser-use能够适应不同的LLM接口和输出格式。

提示词验证机制

Browser-use实现了输出验证机制，通过_validate_output方法检查模型输出是否符合预期：

async def _validate_output(self) -> bool:

    """Validate the output of the last action is what the user wanted"""

    system_msg = (

        f'You are a validator of an agent who interacts with a browser. '

        f'Validate if the output of last action is what the user wanted and if the task is completed. '

        f'If the task is unclear defined, you can let it pass. But if something is missing or the image does not show what was requested dont let it pass. '

        f'Try to understand the page and help the model with suggestions like scroll, do x, ... to get the solution right. '

        f'Task to validate: {self.task}. Return a JSON object with 2 keys: is_valid and reason. '

        f'is_valid is a boolean that indicates if the output is correct. '

        f'reason is a string that explains why it is valid or not.'

    )

    # 获取当前浏览器状态

    state = await self.browser_context.get_state()

    content = AgentMessagePrompt(

        state=state,

        result=self.state.last_result,

        include_attributes=self.settings.include_attributes,

    )

    msg = [SystemMessage(content=system_msg), content.get_user_message(self.settings.use_vision)]

    # 定义验证结果模型

    class ValidationResult(BaseModel):

        is_valid: bool

        reason: str

    # 使用结构化输出获取验证结果

    validator = self.llm.with_structured_output(ValidationResult, include_raw=True)

    response: dict[str, Any] = await validator.ainvoke(msg)

    parsed: ValidationResult = response['parsed']

    is_valid = parsed.is_valid

    # 处理验证结果

    if not is_valid:

        logger.info(f' Validator decision: {parsed.reason}')

        msg = f'The output is not yet correct. {parsed.reason}.'

        self.state.last_result = [ActionResult(extracted_content=msg, include_in_memory=True)]

    else:

        logger.info(f' Validator decision: {parsed.reason}')

    return is_valid

这种验证机制增强了Browser-use的可靠性，能够在任务执行过程中自动检测问题并提供纠正建议。

多模型配置与提示词定制

Browser-use支持多种LLM，不同模型可能需要特定的提示词处理：

def _convert_input_messages(self, input_messages: list[BaseMessage]) -> list[BaseMessage]:

    """Convert input messages to the correct format"""

    if self.model_name == 'deepseek-reasoner' or self.model_name.startswith('deepseek-r1'):

        return convert_input_messages(input_messages, self.model_name)

    else:

        return input_messages

对于不支持函数调用的模型，Browser-use会做特殊处理：

def _convert_messages_for_non_function_calling_models(input_messages: list[BaseMessage]) -> list[BaseMessage]:

    """Convert messages for non-function-calling models"""

    output_messages = []

    for message in input_messages:

        if isinstance(message, HumanMessage):

            output_messages.append(message)

        elif isinstance(message, SystemMessage):

            output_messages.append(message)

        elif isinstance(message, ToolMessage):

            output_messages.append(HumanMessage(content=message.content))

        elif isinstance(message, AIMessage):

            # 检查tool_calls是否为有效的JSON对象

            if message.tool_calls:

                tool_calls = json.dumps(message.tool_calls)

                output_messages.append(AIMessage(content=tool_calls))

            else:

                output_messages.append(message)

        else:

            raise ValueError(f'Unknown message type: {type(message)}')

    return output_messages

实践案例：使用不同的模型

Browser-use支持多种LLM提供商，如OpenAI、Anthropic、千问等：

# 使用OpenAI

agent = Agent(

    task="搜索四川的10大景点",

    llm=ChatOpenAI(model="gpt-4o"),

)

# 使用Anthropic Claude

agent = Agent(

    task="搜索四川的10大景点",

    llm=ChatAnthropic(model_name="claude-3-5-sonnet"),

)

# 使用ModelScope的千问模型

agent = Agent(

    task="搜索四川的10大景点",

    llm=ChatOpenAI(

        model='Qwen/Qwen2.5-72B-Instruct',

        api_key='xxx',

        base_url='https://api.modelscope.cn/v1/'

    ),

    browser=browser,

    use_vision=False,

)

最佳实践与优化建议

令牌管理：Browser-use通过max_input_tokens参数控制输入令牌数量，防止超出模型限制。当接近限制时，会自动裁剪历史消息：

if 'Max token limit reached' in error_msg:

    # 减少令牌限制

    self._message_manager.settings.max_input_tokens = self.settings.max_input_tokens - 500

    self._message_manager.cut_messages()

扩展系统提示词：使用extend_system_message比完全覆盖系统提示词更安全：

extend_system_message = """

重要规则：无论任务是什么，始终先打开一个新标签页并首先访问baidu.com。

"""

agent = Agent(

    task="搜索四川的10大景点",

    llm=ChatOpenAI(model='gpt-4'),

    extend_system_message=extend_system_message

)

规划器配置：对于复杂任务，配置规划器和规划间隔可以提高执行效率：

agent = Agent(

    task="搜索并比较四川十大景点的门票价格和游览时间",

    llm=ChatOpenAI(model='gpt-4o'),

    planner_llm=ChatOpenAI(model='gpt-4o'),

    planner_interval=3  # 每3步执行一次规划

)

视觉选择：根据任务需要选择是否使用视觉能力：

agent = Agent(

    task="提取网页文本内容",

    llm=ChatOpenAI(model='gpt-4o'),

    use_vision=True,  # 启用视觉能力

    use_vision_for_planner=False  # 规划器不使用视觉能力

)

总结

通过深入理解Browser-use的提示词构造机制，开发者可以优化自动化应用，实现更复杂的任务，同时保持高可靠性和适应性。提示词工程是Browser-use框架的核心，也是其能够应对各种复杂Web场景的关键所在。

想了解更多技术实现细节和源码解析，欢迎关注我的微信公众号【松哥ai自动化】。每周我都会带来一篇深度技术文章，从源码角度剖析各种实用工具的实现原理。

下一篇我们将深入分析Browser-use如何处理复杂的界面交互操作，包括表单填写、多步骤导航和动态内容处理等高级场景，敬请关注！

附录

（一）系统提示词输出示例

Message Type: SystemMessage

Content: You are an AI agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules.

# Input Format

Task

Previous steps

Current URL

Open Tabs

Interactive Elements

[index]<type>text</type>

- index: Numeric identifier for interaction

- type: HTML element type (button, input, etc.)

- text: Element description

Example:

[33]<button>Submit Form</button>

- Only elements with numeric indexes in [] are interactive

- elements without [] provide only context

（二）用户消息提示词输出示例

Content: Your ultimate task is: """采集四川的10大景点""". If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.

Message Type: HumanMessage

Content: Example output:

（三）规划器提示词输出示例

Tool Calls: [

  {

    "name": "AgentOutput",

    "args": {

      "current_state": {

        "evaluation_previous_goal": "Success - I opend the first page",

        "memory": "Starting with the new task. I have completed 1/10 steps",

        "next_goal": "Click on company a"

      },

      "action": [

        {

          "click_element": {

            "index": 0

          }

        }

      ]

    },

    "id": "1",

    "type": "tool_call"

  }

]

揭秘AI自动化框架Browser-use(二),如何构造大模型提示词的更多相关文章

【基于Puppeteer前端自动化框架】【二】PO模式，断言（如何更简便逻辑的写测试代码）
一.概要前面介绍了Puppeteer+jest+TypeScript做UI自动化,但是这知识基础的,我们实现自动化要考虑的很多,比如PO模式,比如配置文件,比如断言等等.下面就来一一实现我是怎么用p ...
Web自动化框架搭建之二基于数据驱动应用简单实例~~
整体框架,先划分成细小功能模块~~,从最简单的开始,介绍实现循环百度搜索实例: #coding=utf-8 '''Created on 2014��6��9�� @author: 小鱼'''impo ...
接口自动化框架两大神器-正则提取器和Jsonpath提取器
一接口自动化框架一框架结构二结构说明 - API 用于封装被测系统的接口(用request模块封装的请求方法) - TestCase 将一个或多个接口封装成测试用例,并使用UnitTest管 ...
Python3+Selenium2完整的自动化测试实现之旅（七）：完整的轻量级自动化框架实现
一.前言前面系列Python3+Selenium2自动化系列博文,陆陆续续总结了自动化环境最基础环境的搭建.IE和Chrome浏览器驱动配置.selenium下的webdriver模块提供的元素定位 ...
基于Selenium的Web自动化框架增强篇
在写完上一篇“基于Selenium的Web自动化框架”(http://www.cnblogs.com/AlwinXu/p/5836709.html)之后一直没有时间重新审视该框架,正好趁着给同事分享的 ...
python+request接口自动化框架
python+request接口自动化框架搭建 1.数据准备2.用python获取Excel文件中测试用例数据3.通过requests测试接口4.根据接口返回的code值和Excel对比但本章只讲整 ...
为测试赋能，腾讯WeTest探索手游AI自动化测试之路
作者:周大军/孙大伟, 腾讯后台开发高级工程师商业转载请联系腾讯WeTest获得授权,非商业转载请注明出处. WeTest导读做好自动化测试从来不件容易的事情,更何况是手游的自动化测试,相比传 ...
APP自动化框架LazyAndroid使用手册（2）--元素自动抓取
作者:黄书力概述前面的一篇博文简要介绍了安卓自动化测试框架LazyAndroid的组成结构和基本功能,本文将详细描述此框架中元素自动抓取工具lazy-uiautomaterviewer的使用方法. ...
Android自动化框架介绍
随着Android应用得越来越广,越来越多的公司推出了自己移动应用测试平台.例如,百度的MTC.东软易测云.Testin云测试平台…….由于自己所在项目组就是做终端测试工具的,故抽空了解了下几种常见的 ...
接口自动化框架(Pytest+request+Allure)
前言: 接口自动化是指模拟程序接口层面的自动化,由于接口不易变更,维护成本更小,所以深受各大公司的喜爱. 接口自动化包含2个部分,功能性的接口自动化测试和并发接口自动化测试. 本次文章着重介绍第一种, ...

随机推荐

Windows的MySQL数据库升级（解压包方式）
1.背景描述原来的 MySQL 在安装时,是最新的稳定版本 5.7.33 . 经过一段时间后,在原来的 MySQL 版本中,发现存在漏洞. 因为 MySQL 的官方补丁,需要 Oracle 的 si ...
Shell - 脚本案例
题记部分一.节点状态监控脚本(nodeStatusCheck.sh) [脚本名称]nodeStatusCheck.sh [监控规则]通过ping的方式监控集群节点状态,检查节点是否失联 [实现方式] ...
PowerShell实现读取照片并做灰度处理
Powershell一直是我的学习目标.做一个小例子.PowerShell实现读取照片并做灰度处理.还想要保存这张灰度照片并直接打开查看. 分析需求: [读取照片] 需要借助.net framewo ...
thinkphp6实现仿微信朋友圈,用户可发布图片和文字内容,用户可评论,其他用户可评论文章,也可回复用户评论,多层级评论,无限级评论
功能:仿微信朋友圈,用户可发布图片和文字内容,用户可评论,其他用户可评论文章,也可回复用户评论,多层级评论,无限级评论数据库示例:朋友圈内容表 article表:id content image li ...
C# TCP/IP通信，Socket通信例子
1.服务端建立监听,等待客户端连接 class Program { static void Main(string[] args) { TcpListener listener = new TcpLi ...
sql server 2017 STRING_AGG() 替代方案
SELECT @StuId='"'+STRING_AGG(Id,'","')+'"'FROM( SELECT 'a'+cast(Id as varchar) I ...
C# 生成缩略图方法
private static string CreateThumbnail(string filepath, int tWidth, int tHeight) { if (string.IsNullO ...
VulnHub2018_DeRPnStiNK靶机渗透练习
据说该靶机有四个flag 扫描扫描附近主机arp-scan -l 扫主目录扫端口 nmap -sS -sV -n -T4 -p- 192.168.xx.xx 结果如下 Starting Nmap ...
DELPHI 检测服务器地址是否有效
利用DELPH 的ICMP控件检测服务器地址 function CheckNetServer():Boolean; begin IdIcmpClient1.Host := '192.168.1.230 ...
Hack The Box-Chemistry靶机渗透
通过信息收集访问5000端口,cif历史cve漏洞反弹shell,获取数据库,利用低权限用户登录,监听端口,开放8080端口,aihttp服务漏洞文件包含,获取root密码hash值,ssh指定登录 ...

揭秘AI自动化框架Browser-use(二),如何构造大模型提示词