Meta官方的Prompt工程指南：Llama 2這樣用更高效

機器之心報道

編輯：小舟

隨著大型語言模型（LLM）技術日漸成熟，提示工程（Prompt Engineering）變得越來越重要。一些研究機構發佈了 LLM 提示工程指南，包括微軟、OpenAI 等等。

最近，Llama 系列開源模型的提出者 Meta 也針對 Llama 2 發佈了一份交互式提示工程指南，涵蓋了 Llama 2 的快速工程和最佳實踐。

以下是這份指南的核心內容。

Llama 模型

2023 年，Meta 推出了 Llama 、Llama 2 模型。較小的模型部署和運行成本較低，而更大的模型能力更強。

Llama 2 系列模型參數規模如下：

Code Llama 是一個以代碼為中心的 LLM，建立在 Llama 2 的基礎上，也有各種參數規模和微調變體：

部署 LLM

LLM 可以通過多種方式部署和訪問，包括：

自托管（Self-hosting）：使用本地硬件來運行推理，例如使用 llama.cpp 在 Macbook Pro 上運行 Llama 2。優勢：自托管最適合有隱私 / 安全需要的情況，或者您擁有足夠的 GPU。

雲托管：依靠雲提供商來部署托管特定模型的實例，例如通過 AWS、Azure、GCP 等雲提供商來運行 Llama 2。優勢：雲托管是最適合自定義模型及其運行時的方式。

托管 API：通過 API 直接調用 LLM。有許多公司提供 Llama 2 推理 API，包括 AWS Bedrock、Replicate、Anyscale、Together 等。優勢：托管 API 是總體上最簡單的選擇。

托管 API

托管 API 通常有兩個主要端點（endpoint）：

1. completion：生成對給定 prompt 的響應。

2. chat_completion：生成消息列表中的下一條消息，為聊天機器人等用例提供更明確的指令和上下文。

token

LLM 以稱為 token 的塊的形式來處理輸入和輸出，每個模型都有自己的 tokenization 方案。比如下面這句話：

Our destiny is written in the stars.

Llama 2 的 tokenization 為 ["our", "dest", "iny", "is", "writing", "in", "the", "stars"]。考慮 API 定價和內部行為（例如超參數）時，token 顯得尤為重要。每個模型都有一個 prompt 不能超過的最大上下文長度，Llama 2 是 4096 個 token，而 Code Llama 是 100K 個 token。

Notebook 設置

作為示例，我們使用 Replicate 調用 Llama 2 chat，並使用 LangChain 輕松設置 chat completion API。

首先安裝先決條件：

pip install langchain replicate

from typing import Dict, List

from langchain.llms import Replicate

from langchain.memory import ChatMessageHistory

from langchain.schema.messages import get_buffer_string

import os

# Get a free API key from https://replicate.com/account/api-tokens

os.environ ["REPLICATE_API_TOKEN"] = "YOUR_KEY_HERE"

LLAMA2_70B_CHAT = "meta/llama-2-70b-chat:2d19859030ff705a87c746f7e96eea03aefb71f166725aee39692f1476566d48"

LLAMA2_13B_CHAT = "meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d"

# We'll default to the smaller 13B model for speed; change to LLAMA2_70B_CHAT for more advanced (but slower) generations

DEFAULT_MODEL = LLAMA2_13B_CHAT

def completion (

prompt: str,

model: str = DEFAULT_MODEL,

temperature: float = 0.6,

top_p: float = 0.9,

) -> str:

llm = Replicate (

model=model,

model_kwargs={"temperature": temperature,"top_p": top_p, "max_new_tokens": 1000}

return llm (prompt)

def chat_completion (

messages: List [Dict],

model = DEFAULT_MODEL,

temperature: float = 0.6,

top_p: float = 0.9,

) -> str:

history = ChatMessageHistory ()

for message in messages:

if message ["role"] == "user":

history.add_user_message (message ["content"])

elif message ["role"] == "assistant":

history.add_ai_message (message ["content"])

else:

raise Exception ("Unknown role")

return completion (

get_buffer_string (

history.messages,

human_prefix="USER",

ai_prefix="ASSISTANT",

model,

temperature,

top_p,

def assistant (content: str):

return { "role": "assistant", "content": content }

def user (content: str):

return { "role": "user", "content": content }

def complete_and_print (prompt: str, model: str = DEFAULT_MODEL):

print (f'==============\n {prompt}\n==============')

response = completion (prompt, model)

print (response, end='\n\n')

Completion API

complete_and_print ("The typical color of the sky is:")

complete_and_print ("which model version are you?")

Chat Completion 模型提供了與 LLM 互動的額外結構，將結構化消息對象數組而不是單個文本發送到 LLM。此消息列表為 LLM 提供了一些可以繼續進行的「背景」或「歷史」信息。

通常，每條消息都包含角色和內容：

具有系統角色的消息用於開發人員向 LLM 提供核心指令。

具有用戶角色的消息通常是人工提供的消息。

具有助手角色的消息通常由 LLM 生成。

response = chat_completion (messages=[

user ("My favorite color is blue."),

assistant ("That's great to hear!"),

user ("What is my favorite color?"),

])

print (response)

# "Sure, I can help you with that! Your favorite color is blue."

LLM 超參數

LLM API 通常會采用影響輸出的創造性和確定性的參數。在每一步中，LLM 都會生成 token 及其概率的列表。可能性最小的 token 會從列表中「剪切」（基於 top_p），然後從剩餘候選者中隨機（溫度參數 temperature）選擇一個 token。換句話說：top_p 控制生成中詞匯的廣度，溫度控制詞匯的隨機性，溫度參數 temperature 為 0 會產生幾乎確定的結果。

def print_tuned_completion (temperature: float, top_p: float):

response = completion ("Write a haiku about llamas", temperature=temperature, top_p=top_p)

print (f'[temperature: {temperature} | top_p: {top_p}]\n {response.strip ()}\n')

print_tuned_completion (0.01, 0.01)

# These two generations are highly likely to be the same

print_tuned_completion (1.0, 1.0)

# These two generations are highly likely to be different

prompt 技巧

詳細、明確的指令會比開放式 prompt 產生更好的結果：

complete_and_print (prompt="Describe quantum physics in one short sentence of no more than 12 words")

# Returns a succinct explanation of quantum physics that mentions particles and states existing simultaneously.

我們可以給定使用規則和限制，以給出明確的指令。

風格化，例如：
向我解釋一下這一點，就像兒童教育網絡節目中教授小學生一樣；
我是一名軟件工程師，使用大型語言模型進行摘要。用 250 字概括以下文字；
像私傢偵探一樣一步步追查案件，給出你的答案。
格式化
使用要點；
以 JSON 對象形式返回；
使用較少的技術術語並用於工作交流中。
限制
僅使用學術論文；
切勿提供 2020 年之前的來源；
如果你不知道答案，就說你不知道。

以下是給出明確指令的例子：

complete_and_print ("Explain the latest advances in large language models to me.")

# More likely to cite sources from 2017

complete_and_print ("Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.")

# Gives more specific advances and only cites sources from 2020

零樣本 prompting

一些大型語言模型（例如 Llama 2）能夠遵循指令並產生響應，而無需事先看過任務示例。沒有示例的 prompting 稱為「零樣本 prompting（zero-shot prompting）」。例如：

complete_and_print ("Text: This was the best movie I've ever seen! \n The sentiment of the text is:")

# Returns positive sentiment

complete_and_print ("Text: The director was trying too hard. \n The sentiment of the text is:")

# Returns negative sentiment

少樣本 prompting

添加所需輸出的具體示例通常會產生更加準確、一致的輸出。這種方法稱為「少樣本 prompting（few-shot prompting）」。例如：

def sentiment (text):

response = chat_completion (messages=[

user ("You are a sentiment classifier. For each message, give the percentage of positive/netural/negative."),

user ("I liked it"),

assistant ("70% positive 30% neutral 0% negative"),

user ("It could be better"),

assistant ("0% positive 50% neutral 50% negative"),

user ("It's fine"),

assistant ("25% positive 50% neutral 25% negative"),

user (text),

])

return response

def print_sentiment (text):

print (f'INPUT: {text}')

print (sentiment (text))

print_sentiment ("I thought it was okay")

# More likely to return a balanced mix of positive, neutral, and negative

print_sentiment ("I loved it!")

# More likely to return 100% positive

print_sentiment ("Terrible service 0/10")

# More likely to return 100% negative

Role Prompting

Llama 2 在指定角色時通常會給出更一致的響應，角色為 LLM 提供了所需答案類型的背景信息。

例如，讓 Llama 2 對使用 PyTorch 的利弊問題創建更有針對性的技術回答：

complete_and_print ("Explain the pros and cons of using PyTorch.")

# More likely to explain the pros and cons of PyTorch covers general areas like documentation, the PyTorch community, and mentions a steep learning curve

complete_and_print ("Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.")

# Often results in more technical benefits and drawbacks that provide more technical details on how model layers

思維鏈

簡單地添加一個「鼓勵逐步思考」的短語可以顯著提高大型語言模型執行復雜推理的能力（Wei et al. (2022)），這種方法稱為 CoT 或思維鏈 prompting：

complete_and_print ("Who lived longer Elvis Presley or Mozart?")

# Often gives incorrect answer of "Mozart"

complete_and_print ("Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.")

# Gives the correct answer "Elvis"

自洽性（Self-Consistency）

LLM 是概率性的，因此即使使用思維鏈，一次生成也可能會產生不正確的結果。自洽性通過從多次生成中選擇最常見的答案來提高準確性（以更高的計算成本為代價）：

import re

from statistics import mode

def gen_answer ():

response = completion (

"John found that the average of 15 numbers is 40."

"If 10 is added to each number then the mean of the numbers is?"

"Report the answer surrounded by three backticks, for example:```123```",

model = LLAMA2_70B_CHAT

match = re.search (r'```(\d )```', response)

if match is None:

return None

return match.group (1)

answers = [gen_answer () for i in range (5)]

print (

f"Answers: {answers}\n",

f"Final answer: {mode (answers)}",

# Sample runs of Llama-2-70B (all correct):

# [50, 50, 750, 50, 50] -> 50

# [130, 10, 750, 50, 50] -> 50

# [50, None, 10, 50, 50] -> 50

檢索增強生成

有時我們可能希望在應用程序中使用事實知識，那麼可以從開箱即用（即僅使用模型權重）的大模型中提取常見事實：

complete_and_print ("What is the capital of the California?", model = LLAMA2_70B_CHAT)

# Gives the correct answer "Sacramento"

然而，LLM 往往無法可靠地檢索更具體的事實或私人信息。模型要麼聲明它不知道，要麼幻想出一個錯誤的答案：

complete_and_print ("What was the temperature in Menlo Park on December 12th, 2023?")

# "I'm just an AI, I don't have access to real-time weather data or historical weather records."

complete_and_print ("What time is my dinner reservation on Saturday and what should I wear?")

# "I'm not able to access your personal information [..] I can provide some general guidance"

檢索增強生成（RAG）是指在 prompt 中包含從外部數據庫檢索的信息（Lewis et al. (2020)）。RAG 是將事實納入 LLM 應用的有效方法，並且比微調更經濟實惠，微調可能成本高昂並對基礎模型的功能產生負面影響。

MENLO_PARK_TEMPS = {

"2023-12-11": "52 degrees Fahrenheit",

"2023-12-12": "51 degrees Fahrenheit",

"2023-12-13": "51 degrees Fahrenheit",

}

def prompt_with_rag (retrived_info, question):

complete_and_print (

f"Given the following information: '{retrived_info}', respond to: '{question}'"

def ask_for_temperature (day):

temp_on_day = MENLO_PARK_TEMPS.get (day) or "unknown temperature"

prompt_with_rag (

f"The temperature in Menlo Park was {temp_on_day} on {day}'", # Retrieved fact

f"What is the temperature in Menlo Park on {day}?", # User question

ask_for_temperature ("2023-12-12")

# "Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit."

ask_for_temperature ("2023-07-18")

# "I'm not able to provide the temperature in Menlo Park on 2023-07-18 as the information provided states that the temperature was unknown."

程序輔助語言模型

LLM 本質上不擅長執行計算，例如：

complete_and_print ("""

Calculate the answer to the following math problem:

((-5 93 * 4 - 0) * (4^4 -7 0 * 5))

""")

# Gives incorrect answers like 92448, 92648, 95463

Gao et al. (2022) 提出「程序輔助語言模型（Program-aided Language Models，PAL）」的概念。雖然 LLM 不擅長算術，但它們非常擅長代碼生成。PAL 通過指示 LLM 編寫代碼來解決計算任務。

complete_and_print (

"""

# Python code to calculate: ((-5 93 * 4 - 0) * (4^4 -7 0 * 5))

""",

model="meta/codellama-34b:67942fd0f55b66da802218a19a8f0e1d73095473674061a6ea19f2dc8c053152"

# The following code was generated by Code Llama 34B:

num1 = (-5 93 * 4 - 0)

num2 = (4**4 -7 0 * 5)

answer = num1 * num2

print (answer)

原文鏈接：https://github.com/facebookresearch/llama-recipes/blob/main/examples/Prompt_Engineering_with_Llama_2.ipynb?utm_source=twitter&utm_medium=organic_social&utm_campaign=llama&utm_content=video