聊聊我構建SMoE模型的過程

這篇博客詳細講述了從頭開始構建一個稀疏混合專傢（sparse mixture of experts）語言模型的過程。該項目深受 Andrej Karpathy 的 “makemore” 項目啟發，並借鑒了許多可重用的組件。與 makemore 類似，makeMoE 也是一種按字符順序生成文本的自動語言模型，但它采用了所謂的 “稀疏混合專傢” 架構。

文章後面重點介紹了這種架構的核心要素及其實施方法。希望你通過閱讀這篇文章並運行倉庫中的代碼，直觀地理解其運作原理。

GitHub 倉庫提供了從頭到尾的實現代碼：https://github.com/AviSoori1x/makeMoE/tree/main

隨著 Mixtral 的發佈和有關 Llama 3 可能是一個混合專傢型 LLM 的討論，人們對這種模型架構越來越感興趣。但在這種 “稀疏混合專傢” 語言模型中，很多元素是與傳統的 Transformer 模型共享的。雖然看起來簡單，但實際上訓練穩定性是這些模型面臨的一個主要挑戰。像本項目這樣的小規模、可自行修改的實現，可能有助於快速嘗試新的方法。

在此實現中，我對 makemore 架構進行了一些重要修改：

引入了 “稀疏混合專傢” 架構，而不是單純的前饋神經網絡。
實現了 “Top-k 門控” 和 “帶噪聲的 Top-k 門控”。
在初始化方面，雖然這裡使用了 Kaiming He 初始化方法，但這個項目的特點是可以靈活更換，例如嘗試 Xavier 或 Glorot 初始化方法。

但是，以下方面保持了與 makemore 的一致：

數據集的選擇、預處理（如 tokenization）方式，以及 Andrej 最初選擇的語言建模任務 —— 生成類似莎士比亞的文本。
因果自註意力（Casusal self attention）的實現方法。
訓練循環和推理邏輯的設計。

稀疏混合專傢語言模型，正如其名，依賴於一種被稱為自註意力的技術來理解語境。在深入探討混合專傢模塊的細節之前，我們先來回顧一下自註意力的基礎知識。

代碼示例展示了自註意力的工作原理和核心思想，尤其是一種叫做比例點積自註意力的常見形式。在這種方式中，查詢（query）、鍵（key）和值（value）這三組數據都源自同一個輸入序列。為了保證自動文本生成過程的連貫性，尤其是在隻有解碼器的模型中，代碼中實現了一種掩蔽技術。這種技術非常關鍵，因為它隱藏了當前字符之後的信息，使模型的註意力隻集中在之前的序列部分。這種註意力機制被稱為因果自註意力。

值得註意的是，稀疏混合專傢模型並不僅限於 Transformer 模型中的解碼器部分。實際上，這個領域的許多重要研究，特別是 Shazeer 等人的工作，都是基於 T5 架構的，它包含了 Transformer 模型中的編碼器和解碼器兩個部分。

#This code is borrowed from Andrej Karpathy's makemore repository linked in the repo.
The self attention layers in Sparse mixture of experts models are the same as
in regular transformer models
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)
# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1) #B,T,T
v = value(x) #B,T,H
out = wei @ v # (B,T,T) @ (B,T,H) -> (B,T,H)
out.shape

torch.Size([4, 8, 16])

因果自註意力和多頭因果自註意力的代碼結構如下：多頭自註意力通過並行運用多個註意力頭來提高效率，每個頭關註嵌入維度的不同部分。多頭自註意力不僅提高了學習效率，還因其並行實現的特點提升了模型訓練的效率。值得一提的是，為了防止模型過度擬合現象，我在整個實現過程中使用了 dropout 這種正則化技術。

#Causal scaled dot product self-Attention Head
n_embd = 64
n_head = 4
n_layer = 4
head_size = 16
dropout = 0.1
class Head(nn.Module):
    """ one head of self-attention """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

多頭自註意力（Multi-head Self Attention）的實現方式如下：

#Multi-Headed Self Attention
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

首先，我們創建一個稱為 “專傢模塊” 的組件，這實際上是一個結構簡單的多層感知器（Multi Layer Perceptron）。在稀疏專傢混合（Sparse Mixture of Experts，MoE）架構中，雖然每個 Transformer 塊中的自註意力機制保持不變，但塊的結構有了顯著的改變。原先的標準前饋神經網絡被多個 “稀疏激活” 的前饋網絡所取代，這些網絡被稱為 “專傢”。

所謂的 “稀疏激活” 是指，序列中的每個 Token 隻被分配給有限的幾個專傢處理 —— 通常是一個或兩個 —— 而不是全部可用的專傢。這種方式有助於提升訓練和推理速度，因為每次前向傳遞隻需要激活少數專傢。然而，所有的專傢網絡都需要存儲在 GPU 內存中，這在參數總量達到數千億或數萬億時，會帶來一些部署上的挑戰。

#Expert module
class Expert(nn.Module):
    """ An MLP is a simple linear layer followed by a non-linearity i.e. each Expert """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )
    def forward(self, x):
        return self.net(x)

通過一個簡單的例子來理解 Top-k 門控（Top-k Gating）的直覺：

門控網絡，也就是決定哪個專傢網絡將接收來自多頭自註意力的每個 Token 輸出的 “路由器”。假設有 4 個專傢，而某個 Token 需要被發送到排名前兩位的專傢那裡。首先，我們通過一個線性層把 Token 輸入到門控網絡。這一層會把輸入張量的形狀從（2，4，32）轉換為（2，4，4），這裡的（2，4，32）代表（批量大小，Token 數量，n_embed），其中 n_embed 是輸入的通道維度，而（2，4，4）則代表（批量大小，Token 數量，專傢網絡數量）。接下來，我們會在這些張量的最後一個維度上找到最高的兩個值及其對應的索引，這就是我們所說的 “Top-k 選擇”。

#Understanding how gating works
num_experts = 4
top_k=2
n_embed=32
#Example multi-head attention output for a simple illustrative example, consider n_embed=32, context_length=4 and batch_size=2
mh_output = torch.randn(2, 4, n_embed)
topkgate_linear = nn.Linear(n_embed, num_experts) # nn.Linear(32, 4)
logits = topkgate_linear(mh_output)
top_k_logits, top_k_indices = logits.topk(top_k, dim=-1)  # Get top-k experts
top_k_logits, top_k_indices

#output:
(tensor([[[ 0.0246, -0.0190],
          [ 0.1991,  0.1513],
          [ 0.9749,  0.7185],
          [ 0.4406, -0.8357]],
         [[ 0.6206, -0.0503],
          [ 0.8635,  0.3784],
          [ 0.6828,  0.5972],
          [ 0.4743,  0.3420]]], grad_fn=<TopkBackward0>),
 tensor([[[2, 3],
          [2, 1],
          [3, 1],
          [2, 1]],
         [[0, 2],
          [0, 3],
          [3, 2],
          [3, 0]]]))

在稀疏門控機制中，我們通過隻保留最後一個維度中每個索引對應的前 k 個值來得到輸出。其餘的部分被填充為負無窮（-inf），然後通過一個 softmax 激活函數進行處理。這個過程會將負無窮的值變為零，同時讓前兩個最重要的值更加明顯，並確保它們的總和為 1。這種總和為 1 的特性對於專傢輸出的加權是非常重要的。

zeros = torch.full_like(logits, float('-inf')) #full_like clones a tensor and fills it with a specified value (like infinity) for masking or calculations.
sparse_logits = zeros.scatter(-1, top_k_indices, top_k_logits)
sparse_logits

gating_output= F.softmax(sparse_logits, dim=-1)
gating_output

#ouput
tensor([[[0.0000, 0.0000, 0.5109, 0.4891],
         [0.0000, 0.4881, 0.5119, 0.0000],
         [0.0000, 0.4362, 0.0000, 0.5638],
         [0.0000, 0.2182, 0.7818, 0.0000]],
        [[0.6617, 0.0000, 0.3383, 0.0000],
         [0.6190, 0.0000, 0.0000, 0.3810],
         [0.0000, 0.0000, 0.4786, 0.5214],
         [0.4670, 0.0000, 0.0000, 0.5330]]], grad_fn=<SoftmaxBackward0>)

接下來，我們將上述代碼進行推廣和模塊化，並添加了帶噪聲的 Top-k 門控，以實現負載均衡。

# First define the top k router module 
class TopkRouter(nn.Module):
    def __init__(self, n_embed, num_experts, top_k):
        super(TopkRouter, self).__init__()
        self.top_k = top_k
        self.linear =nn.Linear(n_embed, num_experts)
    def forward(self, mh_ouput):
        # mh_ouput is the output tensor from multihead self attention block
        logits = self.linear(mh_output)
        top_k_logits, indices = logits.topk(self.top_k, dim=-1)
        zeros = torch.full_like(logits, float('-inf'))
        sparse_logits = zeros.scatter(-1, indices, top_k_logits)
        router_output = F.softmax(sparse_logits, dim=-1)
        return router_output, indices

現在，讓我們用一些樣本輸入來測試這個功能：

#Testing this out:
num_experts = 4
top_k = 2
n_embd = 32
mh_output = torch.randn(2, 4, n_embd)  # Example input
top_k_gate = TopkRouter(n_embd, num_experts, top_k)
gating_output, indices = top_k_gate(mh_output)
gating_output.shape, gating_output, indices
#And it works!!

#output
(torch.Size([2, 4, 4]),
 tensor([[[0.5284, 0.0000, 0.4716, 0.0000],
          [0.0000, 0.4592, 0.0000, 0.5408],
          [0.0000, 0.3529, 0.0000, 0.6471],
          [0.3948, 0.0000, 0.0000, 0.6052]],
         [[0.0000, 0.5950, 0.4050, 0.0000],
          [0.4456, 0.0000, 0.5544, 0.0000],
          [0.7208, 0.0000, 0.0000, 0.2792],
          [0.0000, 0.0000, 0.5659, 0.4341]]], grad_fn=<SoftmaxBackward0>),
 tensor([[[0, 2],
          [3, 1],
          [3, 1],
          [3, 0]],
         [[1, 2],
          [2, 0],
          [0, 3],
          [2, 3]]]))

盡管最近發佈的混合模型論文沒有提到，我認為在訓練 MoE 模型時，帶噪聲的 Top-k 門控是一個非常重要的工具。我們的目標不是讓所有 Token 都被分配給相同的一組專傢，而是希望在專傢的利用和探索之間達到平衡。為此，在門控網絡的輸出中添加標準正態分佈的噪聲，可以幫助實現負載均衡，從而使訓練過程更加高效。

#Changing the above to accomodate noisy top-k gating
class NoisyTopkRouter(nn.Module):
    def __init__(self, n_embed, num_experts, top_k):
        super(NoisyTopkRouter, self).__init__()
        self.top_k = top_k
        #layer for router logits
        self.topkroute_linear = nn.Linear(n_embed, num_experts)
        self.noise_linear =nn.Linear(n_embed, num_experts)
    def forward(self, mh_output):
        # mh_ouput is the output tensor from multihead self attention block
        logits = self.topkroute_linear(mh_output)
        #Noise logits
        noise_logits = self.noise_linear(mh_output)
        #Adding scaled unit gaussian noise to the logits
        noise = torch.randn_like(logits)*F.softplus(noise_logits)
        noisy_logits = logits   noise
        top_k_logits, indices = noisy_logits.topk(self.top_k, dim=-1)
        zeros = torch.full_like(noisy_logits, float('-inf'))
        sparse_logits = zeros.scatter(-1, indices, top_k_logits)
        router_output = F.softmax(sparse_logits, dim=-1)
        return router_output, indices

現在，讓我們再次對這個實現進行測試。

#Testing this out, again:
num_experts = 8
top_k = 2
n_embd = 16
mh_output = torch.randn(2, 4, n_embd)  # Example input
noisy_top_k_gate = NoisyTopkRouter(n_embd, num_experts, top_k)
gating_output, indices = noisy_top_k_gate(mh_output)
gating_output.shape, gating_output, indices
#It works!!

#output
(torch.Size([2, 4, 8]),
 tensor([[[0.4181, 0.0000, 0.5819, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.4693, 0.5307, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.0000, 0.4985, 0.5015, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.2641, 0.0000, 0.7359, 0.0000, 0.0000]],
         [[0.0000, 0.0000, 0.0000, 0.6301, 0.0000, 0.3699, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.4766, 0.0000, 0.0000, 0.0000, 0.5234],
          [0.0000, 0.0000, 0.0000, 0.6815, 0.0000, 0.0000, 0.3185, 0.0000],
          [0.4482, 0.5518, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]],
        grad_fn=<SoftmaxBackward0>),
 tensor([[[2, 0],
          [1, 0],
          [2, 1],
          [5, 3]],
         [[3, 5],
          [7, 3],
          [3, 6],
          [1, 0]]]))

創建一個稀疏專傢混合模塊的過程主要涉及門控網絡的輸出。在獲取這些結果之後，我們會將前 k 個專傢的輸出與對應的頂部 k 個值進行選擇性相乘。這種選擇性的相乘形成了一個加權總和，這就是 SparseMoe 模塊的輸出。這個過程中的關鍵挑戰是避免不必要的乘法運算。重要的是僅對那些頂部的 k 個專傢進行前向計算，然後計算這個加權和。如果對每個專傢都進行前向計算，那麼使用稀疏 MoE 的目的就會失效，因為它將不再具有稀疏性。

class SparseMoE(nn.Module):
    def __init__(self, n_embed, num_experts, top_k):
        super(SparseMoE, self).__init__()
        self.router = NoisyTopkRouter(n_embed, num_experts, top_k)
        self.experts = nn.ModuleList([Expert(n_embed) for _ in range(num_experts)])
        self.top_k = top_k
    def forward(self, x):
        gating_output, indices = self.router(x)
        final_output = torch.zeros_like(x)
        # Reshape inputs for batch processing
        flat_x = x.view(-1, x.size(-1))
        flat_gating_output = gating_output.view(-1, gating_output.size(-1))
        # Process each expert in parallel
        for i, expert in enumerate(self.experts):
            # Create a mask for the inputs where the current expert is in top-k
            expert_mask = (indices == i).any(dim=-1)
            flat_mask = expert_mask.view(-1)
            if flat_mask.any():
                expert_input = flat_x[flat_mask]
                expert_output = expert(expert_input)
                # Extract and apply gating scores
                gating_scores = flat_gating_output[flat_mask, i].unsqueeze(1)
                weighted_output = expert_output * gating_scores
                # Update final output
                # We need to scatter_add the weighted outputs to their original positions in the batch
                final_output.masked_scatter_(expert_mask.unsqueeze(-1), weighted_output)
        return final_output.view_as(x)

為了驗證上述實現是否有效，使用樣本輸入進行測試是一個很好的方法。實際運行以下代碼後，我們可以看到它確實有效！

import torch
import torch.nn as nn
#Let's test this out
num_experts = 8
top_k = 2
n_embd = 16
dropout=0.1
mh_output = torch.randn(4, 8, n_embd)  # Example multi-head attention output
sparse_moe = SparseMoE(n_embd, num_experts, top_k)
final_output = sparse_moe(mh_output)
print("Shape of the final output:", final_output.shape)

Shape of the final output: torch.Size([4, 8, 16])

強調一點，從路由器 / 門控網絡得到的 top_k 專傢的輸出值的大小同樣非常關鍵。這些 top_k 索引決定了哪些專傢被激活，並且在這些 top_k 維度中的值的大小決定了它們的權重分配。這種加權求和的思想在下面的圖中得到了更詳細的展示。

多頭自註意力和稀疏專傢混合被整合，形成了一個稀疏專傢混合的 Transformer 塊。就像標準的 Transformer 塊一樣，我們添加了跳過連接（skip connections）來確保訓練的穩定性，防止像梯度消失這樣的問題發生。此外，還采用了層歸一化（layer normalization），以進一步穩定學習過程。

#Create a self attention   mixture of experts block, that may be repeated several number of times 
class Block(nn.Module):
    """ Mixture of Experts Transformer block: communication followed by computation (multi-head self attention   SparseMoE) """
    def __init__(self, n_embed, n_head, num_experts, top_k):
        # n_embed: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.smoe = SparseMoE(n_embed, num_experts, top_k)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)
    def forward(self, x):
        x = x   self.sa(self.ln1(x))
        x = x   self.smoe(self.ln2(x))
        return x

最後，我們將這些內容整合起來，創建了一個稀疏專傢混合語言模型。

class SparseMoELanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(*[Block(n_embed, n_head=n_head, num_experts=num_experts,top_k=top_k) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embed) # final layer norm
        self.lm_head = nn.Linear(n_embed, vocab_size)
    def forward(self, idx, targets=None):
        B, T = idx.shape
        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb   pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T 1)
        return idx

在深度神經網絡的有效訓練中，初始化是一個關鍵步驟。這裡我們使用了 Kaiming He 初始化方法，因為專傢網絡中使用了 ReLU 激活函數。你也可以嘗試使用在 Transformer 中更為常見的 Glorot 初始化。Jeremy Howard 的 Fastai 第二部分提供了一堂很棒的課程，從頭實現了這些方法：
https://course.fast.ai/Lessons/lesson17.html。文獻中提到，Glorot 初始化通常用於 Transformer 模型，這可能是一個提升模型性能的機會。

def kaiming_init_weights(m):
    if isinstance (m, (nn.Linear)): 
        init.kaiming_normal_(m.weight)
model = SparseMoELanguageModel()
model.apply(kaiming_init_weights)

我使用了 mlflow 來跟蹤和記錄訓練過程中的重要指標和超參數。我展示的訓練循環中包含了這部分代碼。如果你不想使用 mlflow，makeMoE GitHub 倉庫中的筆記本中也提供了不包含 MLFlow 的代碼。我個人發現，特別是在進行實驗時，使用 mlflow 跟蹤參數和指標非常方便。

#Using MLFlow
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
#mlflow.set_experiment("makeMoE")
with mlflow.start_run():
    #If you use mlflow.autolog() this will be automatically logged. I chose to explicitly log here for completeness
    params = {"batch_size": batch_size , "block_size" : block_size, "max_iters": max_iters, "eval_interval": eval_interval,
              "learning_rate": learning_rate, "device": device, "eval_iters": eval_iters, "dropout" : dropout, "num_experts": num_experts, "top_k": top_k }
    mlflow.log_params(params)
    for iter in range(max_iters):
        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0 or iter == max_iters - 1:
            losses = estimate_loss()
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
            metrics = {"train_loss": losses['train'], "val_loss": losses['val']}
            mlflow.log_metrics(metrics, step=iter)
        # sample a batch of data
        xb, yb = get_batch('train')
        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

8.996545 M parameters
step 0: train loss 5.3223, val loss 5.3166
step 100: train loss 2.7351, val loss 2.7429
step 200: train loss 2.5125, val loss 2.5233
.
.
.
step 4999: train loss 1.5712, val loss 1.7508

記錄訓練和驗證損失可以幫助我們更好地了解訓練進展。圖表顯示，在大約 4500 步時，當驗證損失略有上升時，我本應該停止訓練。

現在，我們可以使用這個模型逐個字符地生成文本，采用的是自回歸的方式。對於一個稀疏激活的約 900 萬參數模型來說，效果已經相當不錯了。

# generate from the model. Not great. Not too bad either
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

DUKE VINCENVENTIO:
If it ever fecond he town sue kigh now,
That thou wold'st is steen 't.
SIMNA:
Angent her; no, my a born Yorthort,
Romeoos soun and lawf to your sawe with ch a woft ttastly defy,
To declay the soul art; and meart smad.
CORPIOLLANUS:
Which I cannot shall do from by born und ot cold warrike,
What king we best anone wrave's going of heard and good
Thus playvage; you have wold the grace.
...

我希望這個解釋有助於你理解稀疏專傢混合模型的架構及其組合方式。

在實現這個模型時，我主要參考了以下幾篇論文：

https://arxiv.org/pdf/2401.04088.pdf
https://arxiv.org/pdf/1701.06538.pdf

Andrej Karpathy 的原始 makemore 實現：

https://github.com/karpathy/makemore

整個代碼是在 Databricks 平臺上，使用單個 A100 顯卡開發的。如果你在 Databricks 上運行這個模型，你可以在任何大型 GPU 集群上進行擴展，選擇你喜歡的雲服務提供商。我選擇使用 MLFlow（Databricks 中預安裝，也可以在其他地方通過 pip 安裝），因為它能方便地跟蹤和記錄所有必要的指標。當然，使用 MLFlow 完全是可選的。

請註意，這個實現的重點是在於可讀性和可修改性，而非最高性能，因此還有很多改進的空間。

基於此，你可以嘗試以下幾點：

提高專傢混合模塊的效率。我認為在正確專傢的稀疏激活方面可以做出顯著改進。
嘗試不同的神經網絡初始化策略。我提到的 Fastai 第二部分是一個很好的資源。
嘗試從字符級別轉換到子詞標記化。
對專傢數量和 top_k（每個 Token 激活的專傢數量）進行貝葉斯超參數搜索，這可以歸類為神經網絡架構搜索。
專傢容量在這裡沒有討論或實現，但探索這一點是非常有價值的。

隨著對專傢混合和多模態的興趣日益增加，探索這兩者的交叉將非常有趣。祝你編程愉快！