Pengwee Wang's blog

SimCLR

Thu, 18 Dec 2025 00:00:00 GMT

整体思想

SimCLR是自监督视觉表征对比学习算法，核心是学习具有强判别性与泛化性的图像表征（不受数据增强/无关变异影响 —— 忽略裁剪、颜色变化、模糊等表面干扰；同时捕捉不同类别的核心语义） ~~，使同一类别图像在特征空间自然聚集（距离较近、相似度较高），不同类别图像在特征空间中分离（距离较远、相似度较低）~~ 。

https://zhuanlan.zhihu.com/p/197802321

总而言之，CL的思想就是如果两个事物相似，那么我们希望这两个事物的编码也相似。实际上目前大部分的做法也都是降维后计算contrastive loss。然而，难点就在降维度的过程中 contrastive loss 与对比samples的设计。

实验结论：

数据增强的组合对定义有效预测任务至关重要

在表征与对比损失间引入可学习的非线性变换能显著提升表征质量

对比学习相比监督学习更受益于更大的批次大小和更多训练step；

‍

实现细节

SimCLR架构：

随机数据增强 $\mathcal{T}$
视觉表征编码器 $f(\cdot)$
projection head $g(\cdot)$，包含非线性激活函数
对比损失函数NT-Xent（归一化温度缩放交叉熵）
- 余弦相似度 $sim(\boldsymbol{u}, \boldsymbol{v}) = \boldsymbol{u}^T \boldsymbol{v}/|\boldsymbol{u}| |\boldsymbol{v}|$
- $\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/T)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i,j]} \exp(\text{sim}(z_i, z_k)/T)}$

训练算法：

对于一个batch的$N$个样本${x_k}^N_{k=1}$，做两个不同的数据增强 $t \sim \mathcal{T}, t^{'} \sim \mathcal{T}$，得到$2N$个样本，其中$x_{2k-1},x_{2k}$是同一张样本图片的两个不同数据增强，经过$f(\cdot),g(\cdot)$得到$z_{2k-1},z_{2k}$

计算损失函数$\mathcal{L} = \frac{1}{2N} \sum_{k=1}^{N} \left[ \ell(2k-1, 2k) + \ell(2k, 2k-1) \right]$

实验中将batch中同一图片的不同数据增强作为正样本，不同图片的数据增强作为负样本，并且batch设置的很大（4096~8192），目的之一个人认为是为了减小同一类别的不同图片作为负样本的影响（因为此时他们的表征投影到低维应该相似）

‍

非线性投影层可以有效提高其之前的表征质量

实验表明：非线性投影层 $g(\cdot)$ 的投影矩阵 $W$ 是低秩的，说明“少数维度承载主要有效信息”

矩阵的 “秩” 代表矩阵中独立且有价值的信息维度数。

高秩矩阵：多数维度都承载独特信息，没有明显的冗余；
低秩矩阵：仅需少数维度就能近似还原矩阵的核心信息，剩余大量维度要么是重复信息，要么是无意义噪声（对任务贡献极小）。

即对 $h$ 进行过滤，得到适合对比损失函数计算的特征适配表征和损失函数，

同时约束引导 $h$ 学习有用的表征。

为什么不直接使用 $z$ ？因为 $g(\cdot)$ 只是引导作用，适配的是损失函数，而不是下游任务。

Annotated Transformer

Fri, 05 Dec 2025 00:00:00 GMT

文章原文在 http://nlp.seas.harvard.edu/annotated-transformer/，本人为了便于访问只做转载，著作权归原作者所有

v2022: Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman.
Original: Sasha Rush.

The Transformer has been on a lot of people's minds over the last year five years. This post presents an annotated version of the paper in the form of a line-by-line implementation. It reorders and deletes some sections from the original paper and adds comments throughout. This document itself is a working notebook, and should be a completely usable implementation. Code is available here.

Prelims

Skip

# !pip install -r requirements.txt

# # Uncomment for colab
# #
# !pip install -q torchdata==0.3.0 torchtext==0.12 spacy==3.2 altair GPUtil
# !python -m spacy download de_core_news_sm
# !python -m spacy download en_core_web_sm

import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import spacy
import GPUtil
import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP


# Set to False to skip notebook execution (e.g. for debugging)
warnings.filterwarnings("ignore")
RUN_EXAMPLES = True

# Some convenience helper functions used throughout the notebook


def is_interactive_notebook():
    return __name__ == "__main__"


def show_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        return fn(*args)


def execute_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        fn(*args)


class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None

    def step(self):
        None

    def zero_grad(self, set_to_none=False):
        None


class DummyScheduler:
    def step(self):
        None

My comments are blockquoted. The main text is all from the paper itself.

Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

Part 1: Model Architecture

Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure (cite). Here, the encoder maps an input sequence of symbol representations $(x_1, ..., x_n)$ to a sequence of continuous representations $\mathbf{z} = (z_1, ..., z_n)$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_1,...,y_m)$ of symbols one element at a time. At each step the model is auto-regressive (cite), consuming the previously generated symbols as additional input when generating the next.

class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many
    other models.
    """

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

Encoder and Decoder Stacks

Encoder

The encoder is composed of a stack of $N=6$ identical layers.

def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class Encoder(nn.Module):
    "Core encoder is a stack of N layers"

    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

We employ a residual connection (cite) around each of the two sub-layers, followed by layer normalization (cite).

class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

That is, the output of each sub-layer is $\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$, where $\mathrm{Sublayer}(x)$ is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{\text{model}}=512$.

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

Decoder

The decoder is also composed of a stack of $N=6$ identical layers.

class Decoder(nn.Module):
    "Generic N layer decoder with masking."

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.

def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(
        torch.uint8
    )
    return subsequent_mask == 0

Below the attention mask shows the position each tgt word (row) is allowed to look at (column). Words are blocked for attending to future words during training.

def example_mask():
    LS_data = pd.concat(
        [
            pd.DataFrame(
                {
                    "Subsequent Mask": subsequent_mask(20)[0][x, y].flatten(),
                    "Window": y,
                    "Masking": x,
                }
            )
            for y in range(20)
            for x in range(20)
        ]
    )

    return (
        alt.Chart(LS_data)
        .mark_rect()
        .properties(height=250, width=250)
        .encode(
            alt.X("Window:O"),
            alt.Y("Masking:O"),
            alt.Color("Subsequent Mask:Q", scale=alt.Scale(scheme="viridis")),
        )
        .interactive()
    )


show_example(example_mask)

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as:

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

The two most commonly used attention functions are additive attention (cite), and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ (cite). We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients (To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean $0$ and variance $1$. Then their dot product, $q \cdot k = \sum_{i=1}^{d_k} q_ik_i$, has mean $0$ and variance $d_k$.). To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

$$ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, ..., \mathrm{head_h})W^O \ \text{where}~\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i) $$

Where the projections are parameter matrices $W^Q_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^V_i \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$.

In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        del query
        del key
        del value
        return self.linears[-1](x)

Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:

In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as (cite).
The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to $-\infty$) all values in the input of the softmax which correspond to illegal connections.

Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

$$\mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2$$

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_{\text{model}}=512$, and the inner-layer has dimensionality $d_{ff}=2048$.

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{\text{model}}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by $\sqrt{d_{\text{model}}}$.

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{\text{model}}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed (cite).

In this work, we use sine and cosine functions of different frequencies:

$$PE_{(pos,2i)} = \sin(pos / 10000^{2i/d_{\text{model}}})$$

$$PE_{(pos,2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}})$$

where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop}=0.1$.

class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

Below the positional encoding will add in a sine wave based on position. The frequency and offset of the wave is different for each dimension.

def example_positional():
    pe = PositionalEncoding(20, 0)
    y = pe.forward(torch.zeros(1, 100, 20))

    data = pd.concat(
        [
            pd.DataFrame(
                {
                    "embedding": y[0, :, dim],
                    "dimension": dim,
                    "position": list(range(100)),
                }
            )
            for dim in [4, 5, 6, 7]
        ]
    )

    return (
        alt.Chart(data)
        .mark_line()
        .properties(width=800)
        .encode(x="position", y="embedding", color="dimension:N")
        .interactive()
    )


show_example(example_positional)

We also experimented with using learned positional embeddings (cite) instead, and found that the two versions produced nearly identical results. We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

Full Model

Here we define a function from hyperparameters to a full model.

def make_model(
    src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1
):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

Inference:

Here we make a forward step to generate a prediction of the model. We try to use our transformer to memorize the input. As you will see the output is randomly generated due to the fact that the model is not trained yet. In the next tutorial we will build the training function and try to train our model to memorize the numbers from 1 to 10.

def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)


def run_tests():
    for _ in range(10):
        inference_test()


show_example(run_tests)

Example Untrained Model Prediction: tensor([[ 0, 10,  0, 10,  0,  0,  0,  0,  0, 10]])
Example Untrained Model Prediction: tensor([[ 0,  8,  1, 10,  0,  8,  1, 10,  0,  8]])


Example Untrained Model Prediction: tensor([[ 0,  9,  0, 10,  4,  5,  3,  2,  4,  3]])
Example Untrained Model Prediction: tensor([[0, 5, 5, 5, 5, 5, 5, 5, 5, 5]])


Example Untrained Model Prediction: tensor([[0, 2, 8, 3, 8, 5, 0, 4, 0, 4]])
Example Untrained Model Prediction: tensor([[ 0, 10,  3, 10,  2,  9,  0,  3, 10,  3]])


Example Untrained Model Prediction: tensor([[0, 3, 3, 3, 3, 3, 3, 3, 3, 3]])
Example Untrained Model Prediction: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


Example Untrained Model Prediction: tensor([[0, 3, 2, 2, 2, 4, 0, 3, 1, 3]])
Example Untrained Model Prediction: tensor([[0, 6, 6, 6, 6, 6, 6, 6, 6, 6]])

Part 2: Model Training

Training

This section describes the training regime for our models.

We stop for a quick interlude to introduce some of the tools needed to train a standard encoder decoder model. First we define a batch object that holds the src and target sentences for training, as well as constructing the masks.

Batches and Masking

class Batch:
    """Object for holding a batch of data with mask during training."""

    def __init__(self, src, tgt=None, pad=2):  # 2 = <blank>
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if tgt is not None:
            self.tgt = tgt[:, :-1]
            self.tgt_y = tgt[:, 1:]
            self.tgt_mask = self.make_std_mask(self.tgt, pad)
            self.ntokens = (self.tgt_y != pad).data.sum()

    @staticmethod
    def make_std_mask(tgt, pad):
        "Create a mask to hide padding and future words."
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(
            tgt_mask.data
        )
        return tgt_mask

Next we create a generic training and scoring function to keep track of loss. We pass in a generic loss compute function that also handles parameter updates.

Training Loop

class TrainState:
    """Track number of steps, examples, and tokens processed"""

    step: int = 0  # Steps in the current epoch
    accum_step: int = 0  # Number of gradient accumulation steps
    samples: int = 0  # total # of examples used
    tokens: int = 0  # total # of tokens processed

def run_epoch(
    data_iter,
    model,
    loss_compute,
    optimizer,
    scheduler,
    mode="train",
    accum_iter=1,
    train_state=TrainState(),
):
    """Train a single epoch"""
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    n_accum = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(
            batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
        )
        loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
        # loss_node = loss_node / accum_iter
        if mode == "train" or mode == "train+log":
            loss_node.backward()
            train_state.step += 1
            train_state.samples += batch.src.shape[0]
            train_state.tokens += batch.ntokens
            if i % accum_iter == 0:
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                n_accum += 1
                train_state.accum_step += 1
            scheduler.step()

        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i % 40 == 1 and (mode == "train" or mode == "train+log"):
            lr = optimizer.param_groups[0]["lr"]
            elapsed = time.time() - start
            print(
                (
                    "Epoch Step: %6d | Accumulation Step: %3d | Loss: %6.2f "
                    + "| Tokens / Sec: %7.1f | Learning Rate: %6.1e"
                )
                % (i, n_accum, loss / batch.ntokens, tokens / elapsed, lr)
            )
            start = time.time()
            tokens = 0
        del loss
        del loss_node
    return total_loss / total_tokens, train_state

Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary.

Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.

Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).

Optimizer

We used the Adam optimizer (cite) with $\beta_1=0.9$, $\beta_2=0.98$ and $\epsilon=10^{-9}$. We varied the learning rate over the course of training, according to the formula:

$$ lrate = d_{\text{model}}^{-0.5} \cdot \min({step_num}^{-0.5}, {step_num} \cdot {warmup_steps}^{-1.5}) $$

This corresponds to increasing the learning rate linearly for the first $warmup_steps$ training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used $warmup_steps=4000$.

Note: This part is very important. Need to train with this setup of the model.

Example of the curves of this model for different model sizes and for optimization hyperparameters.

def rate(step, model_size, factor, warmup):
    """
    we have to default the step to 1 for LambdaLR function
    to avoid zero raising to negative power.
    """
    if step == 0:
        step = 1
    return factor * (
        model_size ** (-0.5) * min(step ** (-0.5), step * warmup ** (-1.5))
    )

def example_learning_schedule():
    opts = [
        [512, 1, 4000],  # example 1
        [512, 1, 8000],  # example 2
        [256, 1, 4000],  # example 3
    ]

    dummy_model = torch.nn.Linear(1, 1)
    learning_rates = []

    # we have 3 examples in opts list.
    for idx, example in enumerate(opts):
        # run 20000 epoch for each example
        optimizer = torch.optim.Adam(
            dummy_model.parameters(), lr=1, betas=(0.9, 0.98), eps=1e-9
        )
        lr_scheduler = LambdaLR(
            optimizer=optimizer, lr_lambda=lambda step: rate(step, *example)
        )
        tmp = []
        # take 20K dummy training steps, save the learning rate at each step
        for step in range(20000):
            tmp.append(optimizer.param_groups[0]["lr"])
            optimizer.step()
            lr_scheduler.step()
        learning_rates.append(tmp)

    learning_rates = torch.tensor(learning_rates)

    # Enable altair to handle more than 5000 rows
    alt.data_transformers.disable_max_rows()

    opts_data = pd.concat(
        [
            pd.DataFrame(
                {
                    "Learning Rate": learning_rates[warmup_idx, :],
                    "model_size:warmup": ["512:4000", "512:8000", "256:4000"][
                        warmup_idx
                    ],
                    "step": range(20000),
                }
            )
            for warmup_idx in [0, 1, 2]
        ]
    )

    return (
        alt.Chart(opts_data)
        .mark_line()
        .properties(width=600)
        .encode(x="step", y="Learning Rate", color="model_size:warmup:N")
        .interactive()
    )


example_learning_schedule()

Regularization

Label Smoothing

During training, we employed label smoothing of value $\epsilon_{ls}=0.1$ (cite). This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

We implement label smoothing using the KL div loss. Instead of using a one-hot target distribution, we create a distribution that has confidence of the correct word and the rest of the smoothing mass distributed throughout the vocabulary.

class LabelSmoothing(nn.Module):
    "Implement label smoothing."

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(reduction="sum")
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, true_dist.clone().detach())

Here we can see an example of how the mass is distributed to the words based on confidence.

# Example of label smoothing.


def example_label_smoothing():
    crit = LabelSmoothing(5, 0, 0.4)
    predict = torch.FloatTensor(
        [
            [0, 0.2, 0.7, 0.1, 0],
            [0, 0.2, 0.7, 0.1, 0],
            [0, 0.2, 0.7, 0.1, 0],
            [0, 0.2, 0.7, 0.1, 0],
            [0, 0.2, 0.7, 0.1, 0],
        ]
    )
    crit(x=predict.log(), target=torch.LongTensor([2, 1, 0, 3, 3]))
    LS_data = pd.concat(
        [
            pd.DataFrame(
                {
                    "target distribution": crit.true_dist[x, y].flatten(),
                    "columns": y,
                    "rows": x,
                }
            )
            for y in range(5)
            for x in range(5)
        ]
    )

    return (
        alt.Chart(LS_data)
        .mark_rect(color="Blue", opacity=1)
        .properties(height=200, width=200)
        .encode(
            alt.X("columns:O", title=None),
            alt.Y("rows:O", title=None),
            alt.Color(
                "target distribution:Q", scale=alt.Scale(scheme="viridis")
            ),
        )
        .interactive()
    )


show_example(example_label_smoothing)

Label smoothing actually starts to penalize the model if it gets very confident about a given choice.



def loss(x, crit):
    d = x + 3 * 1
    predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d]])
    return crit(predict.log(), torch.LongTensor([1])).data


def penalization_visualization():
    crit = LabelSmoothing(5, 0, 0.1)
    loss_data = pd.DataFrame(
        {
            "Loss": [loss(x, crit) for x in range(1, 100)],
            "Steps": list(range(99)),
        }
    ).astype("float")

    return (
        alt.Chart(loss_data)
        .mark_line()
        .properties(width=350)
        .encode(
            x="Steps",
            y="Loss",
        )
        .interactive()
    )


show_example(penalization_visualization)

A First Example

We can begin by trying out a simple copy-task. Given a random set of input symbols from a small vocabulary, the goal is to generate back those same symbols.

Synthetic Data

def data_gen(V, batch_size, nbatches):
    "Generate random data for a src-tgt copy task."
    for i in range(nbatches):
        data = torch.randint(1, V, size=(batch_size, 10))
        data[:, 0] = 1
        src = data.requires_grad_(False).clone().detach()
        tgt = data.requires_grad_(False).clone().detach()
        yield Batch(src, tgt, 0)

Loss Computation

class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion):
        self.generator = generator
        self.criterion = criterion

    def __call__(self, x, y, norm):
        x = self.generator(x)
        sloss = (
            self.criterion(
                x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)
            )
            / norm
        )
        return sloss.data * norm, sloss

Greedy Decoding

This code predicts a translation using greedy decoding for simplicity.

def greedy_decode(model, src, src_mask, max_len, start_symbol):
    memory = model.encode(src, src_mask)
    ys = torch.zeros(1, 1).fill_(start_symbol).type_as(src.data)
    for i in range(max_len - 1):
        out = model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.zeros(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )
    return ys

# Train the simple copy task.


def example_simple_model():
    V = 11
    criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
    model = make_model(V, V, N=2)

    optimizer = torch.optim.Adam(
        model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )

    batch_size = 80
    for epoch in range(20):
        model.train()
        run_epoch(
            data_gen(V, batch_size, 20),
            model,
            SimpleLossCompute(model.generator, criterion),
            optimizer,
            lr_scheduler,
            mode="train",
        )
        model.eval()
        run_epoch(
            data_gen(V, batch_size, 5),
            model,
            SimpleLossCompute(model.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            mode="eval",
        )[0]

    model.eval()
    src = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
    max_len = src.shape[1]
    src_mask = torch.ones(1, 1, max_len)
    print(greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0))


# execute_example(example_simple_model)

Part 3: A Real World Example

Now we consider a real-world example using the Multi30k German-English Translation task. This task is much smaller than the WMT task considered in the paper, but it illustrates the whole system. We also show how to use multi-gpu processing to make it really fast.

Data Loading

We will load the dataset using torchtext and spacy for tokenization.

# Load spacy tokenizer models, download them if they haven't been
# downloaded already


def load_tokenizers():

    try:
        spacy_de = spacy.load("de_core_news_sm")
    except IOError:
        os.system("python -m spacy download de_core_news_sm")
        spacy_de = spacy.load("de_core_news_sm")

    try:
        spacy_en = spacy.load("en_core_web_sm")
    except IOError:
        os.system("python -m spacy download en_core_web_sm")
        spacy_en = spacy.load("en_core_web_sm")

    return spacy_de, spacy_en

def tokenize(text, tokenizer):
    return [tok.text for tok in tokenizer.tokenizer(text)]


def yield_tokens(data_iter, tokenizer, index):
    for from_to_tuple in data_iter:
        yield tokenizer(from_to_tuple[index])



def build_vocabulary(spacy_de, spacy_en):
    def tokenize_de(text):
        return tokenize(text, spacy_de)

    def tokenize_en(text):
        return tokenize(text, spacy_en)

    print("Building German Vocabulary ...")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_src = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_de, index=0),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

    print("Building English Vocabulary ...")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_tgt = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_en, index=1),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

    vocab_src.set_default_index(vocab_src["<unk>"])
    vocab_tgt.set_default_index(vocab_tgt["<unk>"])

    return vocab_src, vocab_tgt


def load_vocab(spacy_de, spacy_en):
    if not exists("vocab.pt"):
        vocab_src, vocab_tgt = build_vocabulary(spacy_de, spacy_en)
        torch.save((vocab_src, vocab_tgt), "vocab.pt")
    else:
        vocab_src, vocab_tgt = torch.load("vocab.pt")
    print("Finished.\nVocabulary sizes:")
    print(len(vocab_src))
    print(len(vocab_tgt))
    return vocab_src, vocab_tgt


if is_interactive_notebook():
    # global variables used later in the script
    spacy_de, spacy_en = show_example(load_tokenizers)
    vocab_src, vocab_tgt = show_example(load_vocab, args=[spacy_de, spacy_en])

Finished.
Vocabulary sizes:
59981
36745

Batching matters a ton for speed. We want to have very evenly divided batches, with absolutely minimal padding. To do this we have to hack a bit around the default torchtext batching. This code patches their default batching to make sure we search over enough sentences to find tight batches.

Iterators

def collate_batch(
    batch,
    src_pipeline,
    tgt_pipeline,
    src_vocab,
    tgt_vocab,
    device,
    max_padding=128,
    pad_id=2,
):
    bs_id = torch.tensor([0], device=device)  # <s> token id
    eos_id = torch.tensor([1], device=device)  # </s> token id
    src_list, tgt_list = [], []
    for (_src, _tgt) in batch:
        processed_src = torch.cat(
            [
                bs_id,
                torch.tensor(
                    src_vocab(src_pipeline(_src)),
                    dtype=torch.int64,
                    device=device,
                ),
                eos_id,
            ],
            0,
        )
        processed_tgt = torch.cat(
            [
                bs_id,
                torch.tensor(
                    tgt_vocab(tgt_pipeline(_tgt)),
                    dtype=torch.int64,
                    device=device,
                ),
                eos_id,
            ],
            0,
        )
        src_list.append(
            # warning - overwrites values for negative values of padding - len
            pad(
                processed_src,
                (
                    0,
                    max_padding - len(processed_src),
                ),
                value=pad_id,
            )
        )
        tgt_list.append(
            pad(
                processed_tgt,
                (0, max_padding - len(processed_tgt)),
                value=pad_id,
            )
        )

    src = torch.stack(src_list)
    tgt = torch.stack(tgt_list)
    return (src, tgt)

def create_dataloaders(
    device,
    vocab_src,
    vocab_tgt,
    spacy_de,
    spacy_en,
    batch_size=12000,
    max_padding=128,
    is_distributed=True,
):
    # def create_dataloaders(batch_size=12000):
    def tokenize_de(text):
        return tokenize(text, spacy_de)

    def tokenize_en(text):
        return tokenize(text, spacy_en)

    def collate_fn(batch):
        return collate_batch(
            batch,
            tokenize_de,
            tokenize_en,
            vocab_src,
            vocab_tgt,
            device,
            max_padding=max_padding,
            pad_id=vocab_src.get_stoi()["<blank>"],
        )

    train_iter, valid_iter, test_iter = datasets.Multi30k(
        language_pair=("de", "en")
    )

    train_iter_map = to_map_style_dataset(
        train_iter
    )  # DistributedSampler needs a dataset len()
    train_sampler = (
        DistributedSampler(train_iter_map) if is_distributed else None
    )
    valid_iter_map = to_map_style_dataset(valid_iter)
    valid_sampler = (
        DistributedSampler(valid_iter_map) if is_distributed else None
    )

    train_dataloader = DataLoader(
        train_iter_map,
        batch_size=batch_size,
        shuffle=(train_sampler is None),
        sampler=train_sampler,
        collate_fn=collate_fn,
    )
    valid_dataloader = DataLoader(
        valid_iter_map,
        batch_size=batch_size,
        shuffle=(valid_sampler is None),
        sampler=valid_sampler,
        collate_fn=collate_fn,
    )
    return train_dataloader, valid_dataloader

Training the System

def train_worker(
    gpu,
    ngpus_per_node,
    vocab_src,
    vocab_tgt,
    spacy_de,
    spacy_en,
    config,
    is_distributed=False,
):
    print(f"Train worker process using GPU: {gpu} for training", flush=True)
    torch.cuda.set_device(gpu)

    pad_idx = vocab_tgt["<blank>"]
    d_model = 512
    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.cuda(gpu)
    module = model
    is_main_process = True
    if is_distributed:
        dist.init_process_group(
            "nccl", init_method="env://", rank=gpu, world_size=ngpus_per_node
        )
        model = DDP(model, device_ids=[gpu])
        module = model.module
        is_main_process = gpu == 0

    criterion = LabelSmoothing(
        size=len(vocab_tgt), padding_idx=pad_idx, smoothing=0.1
    )
    criterion.cuda(gpu)

    train_dataloader, valid_dataloader = create_dataloaders(
        gpu,
        vocab_src,
        vocab_tgt,
        spacy_de,
        spacy_en,
        batch_size=config["batch_size"] // ngpus_per_node,
        max_padding=config["max_padding"],
        is_distributed=is_distributed,
    )

    optimizer = torch.optim.Adam(
        model.parameters(), lr=config["base_lr"], betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, d_model, factor=1, warmup=config["warmup"]
        ),
    )
    train_state = TrainState()

    for epoch in range(config["num_epochs"]):
        if is_distributed:
            train_dataloader.sampler.set_epoch(epoch)
            valid_dataloader.sampler.set_epoch(epoch)

        model.train()
        print(f"[GPU{gpu}] Epoch {epoch} Training ====", flush=True)
        _, train_state = run_epoch(
            (Batch(b[0], b[1], pad_idx) for b in train_dataloader),
            model,
            SimpleLossCompute(module.generator, criterion),
            optimizer,
            lr_scheduler,
            mode="train+log",
            accum_iter=config["accum_iter"],
            train_state=train_state,
        )

        GPUtil.showUtilization()
        if is_main_process:
            file_path = "%s%.2d.pt" % (config["file_prefix"], epoch)
            torch.save(module.state_dict(), file_path)
        torch.cuda.empty_cache()

        print(f"[GPU{gpu}] Epoch {epoch} Validation ====", flush=True)
        model.eval()
        sloss = run_epoch(
            (Batch(b[0], b[1], pad_idx) for b in valid_dataloader),
            model,
            SimpleLossCompute(module.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            mode="eval",
        )
        print(sloss)
        torch.cuda.empty_cache()

    if is_main_process:
        file_path = "%sfinal.pt" % config["file_prefix"]
        torch.save(module.state_dict(), file_path)

def train_distributed_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    from the_annotated_transformer import train_worker

    ngpus = torch.cuda.device_count()
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12356"
    print(f"Number of GPUs detected: {ngpus}")
    print("Spawning training processes ...")
    mp.spawn(
        train_worker,
        nprocs=ngpus,
        args=(ngpus, vocab_src, vocab_tgt, spacy_de, spacy_en, config, True),
    )


def train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    if config["distributed"]:
        train_distributed_model(
            vocab_src, vocab_tgt, spacy_de, spacy_en, config
        )
    else:
        train_worker(
            0, 1, vocab_src, vocab_tgt, spacy_de, spacy_en, config, False
        )


def load_trained_model():
    config = {
        "batch_size": 32,
        "distributed": False,
        "num_epochs": 8,
        "accum_iter": 10,
        "base_lr": 1.0,
        "max_padding": 72,
        "warmup": 3000,
        "file_prefix": "multi30k_model_",
    }
    model_path = "multi30k_model_final.pt"
    if not exists(model_path):
        train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config)

    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.load_state_dict(torch.load("multi30k_model_final.pt"))
    return model


if is_interactive_notebook():
    model = load_trained_model()

Once trained we can decode the model to produce a set of translations. Here we simply translate the first sentence in the validation set. This dataset is pretty small so the translations with greedy search are reasonably accurate.

Additional Components: BPE, Search, Averaging

So this mostly covers the transformer model itself. There are four aspects that we didn't cover explicitly. We also have all these additional features implemented in OpenNMT-py.

BPE/ Word-piece: We can use a library to first preprocess the data into subword units. See Rico Sennrich's subword-nmt implementation. These models will transform the training data to look like this:

▁Die ▁Protokoll datei ▁kann ▁ heimlich ▁per ▁E - Mail ▁oder ▁FTP ▁an ▁einen ▁bestimmte n ▁Empfänger ▁gesendet ▁werden .

Shared Embeddings: When using BPE with shared vocabulary we can share the same weight vectors between the source / target / generator. See the (cite) for details. To add this to the model simply do this:

if False:
    model.src_embed[0].lut.weight = model.tgt_embeddings[0].lut.weight
    model.generator.lut.weight = model.tgt_embed[0].lut.weight

Beam Search: This is a bit too complicated to cover here. See the OpenNMT-py for a pytorch implementation.

Model Averaging: The paper averages the last k checkpoints to create an ensembling effect. We can do this after the fact if we have a bunch of models:

def average(model, models):
    "Average models into model"
    for ps in zip(*[m.params() for m in [model] + models]):
        ps[0].copy_(torch.sum(*ps[1:]) / len(ps[1:]))

Results

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate Pdrop = 0.1, instead of 0.3.

With the addtional extensions in the last section, the OpenNMT-py replication gets to 26.9 on EN-DE WMT. Here I have loaded in those parameters to our reimplemenation.

# Load data and model for output checks

def check_outputs(
    valid_dataloader,
    model,
    vocab_src,
    vocab_tgt,
    n_examples=15,
    pad_idx=2,
    eos_string="</s>",
):
    results = [()] * n_examples
    for idx in range(n_examples):
        print("\nExample %d ========\n" % idx)
        b = next(iter(valid_dataloader))
        rb = Batch(b[0], b[1], pad_idx)
        greedy_decode(model, rb.src, rb.src_mask, 64, 0)[0]

        src_tokens = [
            vocab_src.get_itos()[x] for x in rb.src[0] if x != pad_idx
        ]
        tgt_tokens = [
            vocab_tgt.get_itos()[x] for x in rb.tgt[0] if x != pad_idx
        ]

        print(
            "Source Text (Input)        : "
            + " ".join(src_tokens).replace("\n", "")
        )
        print(
            "Target Text (Ground Truth) : "
            + " ".join(tgt_tokens).replace("\n", "")
        )
        model_out = greedy_decode(model, rb.src, rb.src_mask, 72, 0)[0]
        model_txt = (
            " ".join(
                [vocab_tgt.get_itos()[x] for x in model_out if x != pad_idx]
            ).split(eos_string, 1)[0]
            + eos_string
        )
        print("Model Output               : " + model_txt.replace("\n", ""))
        results[idx] = (rb, src_tokens, tgt_tokens, model_out, model_txt)
    return results


def run_model_example(n_examples=5):
    global vocab_src, vocab_tgt, spacy_de, spacy_en

    print("Preparing Data ...")
    _, valid_dataloader = create_dataloaders(
        torch.device("cpu"),
        vocab_src,
        vocab_tgt,
        spacy_de,
        spacy_en,
        batch_size=1,
        is_distributed=False,
    )

    print("Loading Trained Model ...")

    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.load_state_dict(
        torch.load("multi30k_model_final.pt", map_location=torch.device("cpu"))
    )

    print("Checking Model Outputs:")
    example_data = check_outputs(
        valid_dataloader, model, vocab_src, vocab_tgt, n_examples=n_examples
    )
    return model, example_data


# execute_example(run_model_example)

Attention Visualization

Even with a greedy decoder the translation looks pretty good. We can further visualize it to see what is happening at each layer of the attention

def mtx2df(m, max_row, max_col, row_tokens, col_tokens):
    "convert a dense matrix to a data frame with row and column indices"
    return pd.DataFrame(
        [
            (
                r,
                c,
                float(m[r, c]),
                "%.3d %s"
                % (r, row_tokens[r] if len(row_tokens) > r else "<blank>"),
                "%.3d %s"
                % (c, col_tokens[c] if len(col_tokens) > c else "<blank>"),
            )
            for r in range(m.shape[0])
            for c in range(m.shape[1])
            if r < max_row and c < max_col
        ],
        # if float(m[r,c]) != 0 and r < max_row and c < max_col],
        columns=["row", "column", "value", "row_token", "col_token"],
    )


def attn_map(attn, layer, head, row_tokens, col_tokens, max_dim=30):
    df = mtx2df(
        attn[0, head].data,
        max_dim,
        max_dim,
        row_tokens,
        col_tokens,
    )
    return (
        alt.Chart(data=df)
        .mark_rect()
        .encode(
            x=alt.X("col_token", axis=alt.Axis(title="")),
            y=alt.Y("row_token", axis=alt.Axis(title="")),
            color="value",
            tooltip=["row", "column", "value", "row_token", "col_token"],
        )
        .properties(height=400, width=400)
        .interactive()
    )

def get_encoder(model, layer):
    return model.encoder.layers[layer].self_attn.attn


def get_decoder_self(model, layer):
    return model.decoder.layers[layer].self_attn.attn


def get_decoder_src(model, layer):
    return model.decoder.layers[layer].src_attn.attn


def visualize_layer(model, layer, getter_fn, ntokens, row_tokens, col_tokens):
    # ntokens = last_example[0].ntokens
    attn = getter_fn(model, layer)
    n_heads = attn.shape[1]
    charts = [
        attn_map(
            attn,
            0,
            h,
            row_tokens=row_tokens,
            col_tokens=col_tokens,
            max_dim=ntokens,
        )
        for h in range(n_heads)
    ]
    assert n_heads == 8
    return alt.vconcat(
        charts[0]
        # | charts[1]
        | charts[2]
        # | charts[3]
        | charts[4]
        # | charts[5]
        | charts[6]
        # | charts[7]
        # layer + 1 due to 0-indexing
    ).properties(title="Layer %d" % (layer + 1))

Encoder Self Attention

def viz_encoder_self():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[
        len(example_data) - 1
    ]  # batch object for the final example

    layer_viz = [
        visualize_layer(
            model, layer, get_encoder, len(example[1]), example[1], example[1]
        )
        for layer in range(6)
    ]
    return alt.hconcat(
        layer_viz[0]
        # & layer_viz[1]
        & layer_viz[2]
        # & layer_viz[3]
        & layer_viz[4]
        # & layer_viz[5]
    )


show_example(viz_encoder_self)

Preparing Data ...


Loading Trained Model ...


Checking Model Outputs:

Example 0 ========



Source Text (Input)        : <s> Mehrere Kinder heben die Hände , während sie auf einem bunten Teppich in einem Klassenzimmer sitzen . </s>
Target Text (Ground Truth) : <s> Several children are raising their hands while sitting on a colorful rug in a classroom . </s>


Model Output               : <s> A group of children are in their hands while sitting on a colorful carpet . </s>

Decoder Self Attention

def viz_decoder_self():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[len(example_data) - 1]

    layer_viz = [
        visualize_layer(
            model,
            layer,
            get_decoder_self,
            len(example[1]),
            example[1],
            example[1],
        )
        for layer in range(6)
    ]
    return alt.hconcat(
        layer_viz[0]
        & layer_viz[1]
        & layer_viz[2]
        & layer_viz[3]
        & layer_viz[4]
        & layer_viz[5]
    )


show_example(viz_decoder_self)

Preparing Data ...


Loading Trained Model ...


Checking Model Outputs:

Example 0 ========



Source Text (Input)        : <s> Drei Menschen wandern auf einem stark verschneiten Weg . </s>
Target Text (Ground Truth) : <s> A <unk> of people are hiking throughout a heavily snowed path . </s>


Model Output               : <s> Three people hiking on a busy <unk> . </s>

Decoder Src Attention

def viz_decoder_src():
    model, example_data = run_model_example(n_examples=1)
    example = example_data[len(example_data) - 1]

    layer_viz = [
        visualize_layer(
            model,
            layer,
            get_decoder_src,
            max(len(example[1]), len(example[2])),
            example[1],
            example[2],
        )
        for layer in range(6)
    ]
    return alt.hconcat(
        layer_viz[0]
        & layer_viz[1]
        & layer_viz[2]
        & layer_viz[3]
        & layer_viz[4]
        & layer_viz[5]
    )


show_example(viz_decoder_src)

Preparing Data ...


Loading Trained Model ...


Checking Model Outputs:

Example 0 ========



Source Text (Input)        : <s> Baby sieht sich die Blätter am Zweig eines Baumes an . </s>
Target Text (Ground Truth) : <s> Baby looking at the leaves on a branch of a tree . </s>


Model Output               : <s> A baby is looking at the leaves at a tree . </s>

Conclusion

Hopefully this code is useful for future research. Please reach out if you have any issues.

Cheers, Sasha Rush, Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, Stella Biderman

Flow Matching and Diffusion Models

Thu, 21 Aug 2025 00:00:00 GMT

Introduction

生成对象（Object）：对图像，视频，蛋白质等数据类型可视为向量，即 $z \in \mathbb{R}^d$

生成（Generation）：从数据分布中采样，$z \sim p_{data}$

数据集（Dataset）：服从数据分布的有限样本，$z_1, ...,z_N \sim p_{data}$

条件生成（Conditional Generation）：从条件分布中采样，$z \sim p_{data}(\cdot \mid y)$

目标：训练生成模型，将初始分布（$p_{\text{init}}$）的样本转化为数据分布样本$p_{\text{data}}$

Flow and Diffusion Models

通过模拟常微分方程（Ordinary Differential Equations, ODEs）和随机微分方程（Stochastic Differential Equations, SDEs）可以实现从初始分布到数据分布的转换，分别对应Flow Model和Diffusion Model

Flow Models

Flow Model可以由ODE来描述，即

$$ X_0 \sim p_{\text{init}} \quad \triangleright \text{random init}\ \frac{d}{dt}X_t=u_t^\theta(X_t) \quad \triangleright \text{ODE} \ \text{Goal: } X_1 \sim p_{\text{data}} \Leftrightarrow \psi_{1}^{\theta}(X_0) \sim p_{\text{data}} $$

其中向量场 $u_t^\theta: \mathbb{R}^d\times[0,1] \rightarrow \mathbb{R}^d$ 为神经网络，$\theta$为参数。$\psi^\theta_t$描述了由$u_t^\theta$引起的Flow，为ODE方程解（Trajectory）的集合

通过使用Euler算法，可以模拟ODE计算出Flow，实现从Flow Model中采样

Diffusion Models

Diffusion Model可以由SDEs描述，如下所示（由于其随机性SDEs不使用微分表示形式）

$$ dX_t = u_t^\theta(X_t)dt +\sigma_tdW_t \quad \triangleright \text{SDE} \ X_0 \sim p_{init} \quad \triangleright \text{random initialization} \ \text{Goal: } X_1 \sim p_{\text{data}} $$

其中 $\sigma_t \geq 0$为diffusion系数，$W_t$为随机过程布朗运动（Brownian motion）

可以看出Diffusion Model是Flow Model的一个拓展，当$\sigma_t = 0$时即为Flow Model

同样的，可以使用以下算法实现从Diffusion Model中采样

Training Target and Train Loss

对于Flow Model和Diffusion Model

$$ \begin{align*} X_0 \sim p_{\text{init}},\quad dX_t &= u_t^\theta(X_t) dt & \text{(Flow model)} \ X_0 \sim p_{\text{init}},\quad dX_t &= u_t^\theta(X_t) dt + \sigma_t dW_t & \text{(Diffusion model)} \end{align*} $$

训练可以通过最小化以下损失实现

$$ \mathcal{L}(\theta) = \left| u_t^\theta(x) - \underbrace{u_t^{\text{target}}(x)}_{\text{training target}} \right|^2 $$

$u_t^\theta$ 为网络模型，$u_t^{\text{target}}(x)$为目标向量场，其实现将初始数据分布转化为目标数据分布，为了实现计算 $\mathcal{L}(\theta)$ 或者间接计算 $\mathcal{L}(\theta)$需要构建$u_t^{\text{target}}(x)$。

Probability Path

Probability Path是从初始分布到目标数据分布的渐进插值（gradual interpolation），分为条件概率路径（conditional probability path）和边缘概率路径（marginal probability path），分别为$p_t(\cdot \mid z)$ 和 $p_t(\cdot)$，其中：

$$ p_0(\cdot \mid z) = p_{\text{init}}, \quad p_1(\cdot \mid z) = \delta_z \quad \text{for all } z \in \mathbb{R}^d $$

$p_t(\cdot)$ 可由以下公式获得

$$ \begin{align*} &z \sim p_{\text{data}},\ x \sim p_t(\cdot \mid z) \implies x \sim p_t &\triangleright \text{sampling from marginal path} \ &p_t(x) = \int p_t(x \mid z) p_{\text{data}}(z)dz &\triangleright \text{density of marginal path} \ &p_0 = p_{\text{init}} \quad \text{and} \quad p_1 = p_{\text{data}} &\triangleright \text{noise-data interpolation} \ \end{align*} $$

Training Target for Flow Model

对于$z \in \mathbb{R^d} \sim p_{data}$，记$u_t^{target}(\cdot \mid z)$为条件概率路径 $p_t(\cdot \mid z)$ 对应的条件向量场，即

$$ X_0 \sim p_{\text{init}},\quad \frac{\mathrm{d}}{\mathrm{d}t}X_t = u_t^{\text{target}}(X_t|z) \quad \Rightarrow \quad X_t \sim p_t(\cdot|z) \quad (0 \leq t \leq 1) $$

则$u_t^{target}(x)$可定义为

$$ u_t^{\text{target}}(x) = \int u_t^{\text{target}}(x|z) \frac{p_t(x|z)p_{\text{data}}(z)}{p_t(x)} ,\mathrm{d}z $$

且满足：

$$ X_0 \sim p_{\text{init}},\quad \frac{\mathrm{d}}{\mathrm{d}t}X_t = u_t^{\text{target}}(X_t) \quad \Rightarrow \quad X_t \sim p_t \quad (0 \leq t \leq 1) $$

其中$X_1 \sim p_{data}$。

这可以由Continuity Equation 证明

Continuity Equation

对于向量场$u_t^{target}$ 且 $X_0 \sim p_{init}$，有$X_t \sim p_t$ 在$0 \leq t \leq 1$ 成立有且仅有

$$ \partial_t p_t(x) = -\mathrm{div}(p_t u_t^{\text{target}})(x) \quad \text{for all } x \in \mathbb{R}^d, 0 \leq t \leq 1 $$

其中$\partial_t p_t(x) = \frac{\mathrm{d}}{\mathrm{d}t} p_t(x)$，$\mathrm{div}(v_t)(x) = \sum_{i=1}^d \frac{\partial}{\partial x_i} v_t(x)$

Training Target for Diffusion Model

同样的，对于Diffusion Model，可以构建$u_t^{target}$如下所示，满足$X_t \sim p_t \quad (0 \leq t \leq 1)$ ，即

$$ \begin{align*} &X_0 \sim p_{\text{init}}, \quad \mathrm{d}X_t = \left[ u_t^{\text{target}}(X_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(X_t) \right] \mathrm{d}t + \sigma_t \mathrm{d}W_t \ &\Rightarrow X_t \sim p_t \quad (0 \leq t \leq 1) \end{align*} $$

并且将$p_t(x), u_t^{target}$ 替换为 $p_t(x\mid z), u_t^{target}(x \mid z)$ 时仍然成立

其中，$\nabla \log p_t(x)$ 称为marginal score function，$\nabla \log p_t(x \mid z)$ 称为conditional score function，二者满足

$$ \nabla \log p_t(x) = \frac{\nabla p_t(x)}{p_t(x)} = \frac{\nabla \int p_t(x|z) p_{\text{data}}(z) ,\mathrm{d}z}{p_t(x)} = \frac{\int \nabla p_t(x|z) p_{\text{data}}(z) ,\mathrm{d}z}{p_t(x)} = \int \nabla \log p_t(x|z) \frac{p_t(x|z) p_{\text{data}}(z)}{p_t(x)} ,\mathrm{d}z $$

这可以由Fokker-Planck Equation证明

Fokker-Planck Equation

对于$X_0 \sim p_{\text{init}}, \quad \mathrm{d}X_t = u_t(X_t),\mathrm{d}t + \sigma_t,\mathrm{d}W_t$ 描述的SDE，$X_t \sim p_t$ 成立，当且仅当

$$ \partial_t p_t(x) = -\mathrm{div}(p_t u_t)(x) + \frac{\sigma_t^2}{2} \Delta p_t(x) \quad \text{for all } x \in \mathbb{R}^d, 0 \leq t \leq 1 $$

其中，$\Delta w_t(x) = \sum_{i=1}^d \frac{\partial^2}{\partial x_i^2} w_t(x) = \mathrm{div}(\nabla w_t)(x)$

Remark Langevin dynamics

当$p_t=p$时，即概率路径为静态时，有

$$ \mathrm{d}X_t = \frac{\sigma_t^2}{2} \nabla \log p(X_t),\mathrm{d}t + \sigma_t,\mathrm{d}W_t $$

此时 $X_0 \sim p \quad \Rightarrow \quad X_t \sim p \quad (t \geq 0)$，即Langevin dynamics

Gaussian probability path

设噪声调度$\alpha_t, \beta_t$为单调连续可微函数且$\alpha_0=\beta_1=0, \alpha_1=\beta_0=1$，定义Gaussian conditional probability path为

$$ p_t(\cdot|z) = \mathcal{N}(\alpha_t z, \beta_t^2 I_d) $$

其满足 $p_0(\cdot|z) = \mathcal{N}(\alpha_0 z, \beta_0^2 I_d) = \mathcal{N}(0, I_d), \quad \text{and} \quad p_1(\cdot|z) = \mathcal{N}(\alpha_1 z, \beta_1^2 I_d) = \delta_z$

则从其marginal path中采样可以通过以下方法得到

$$ z \sim p_{\text{data}},\ \epsilon \sim p_{\text{init}} = \mathcal{N}(0, I_d) \Rightarrow x = \alpha_t z + \beta_t \epsilon \sim p_t $$

基于Gaussian probability path的conditional Gaussian vector field可以计算得到

$$ u_t^{\text{target}}(x|z) = \left( \dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t \right) z + \frac{\dot{\beta}_t}{\beta_t} x $$

其中$\dot{\alpha}_t = \partial_t \alpha_t$，$\dot{\beta}_t = \partial_t \beta_t$

同样的可以得到其marginal score function为

$$ \nabla \log p_t(x|z) = -\frac{x - \alpha_t z}{\beta_t^2} $$

Flow Matching

对于Flow Model，定义flow matching loss为

$$ \begin{align*} \mathcal{L}{\text{FM}}(\theta) &= \mathbb{E}{t \sim \text{Unif}, x \sim p_t}[|u_t^\theta(x) - u_t^{\text{target}}(x)|^2] \ &= \mathbb{E}{t \sim \text{Unif}, z \sim p{\text{data}}, x \sim p_t(\cdot|z)}[|u_t^\theta(x) - u_t^{\text{target}}(x)|^2] \end{align*} $$

$z \sim p_{\text{data}},\ x \sim p_t(\cdot \mid z) \implies x \sim p_t$

定义conditional flow matching loss为

$$ \mathcal{L}{\text{CFM}}(\theta) = \mathbb{E}{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot|z)}[|u_t^\theta(x) - u_t^{\text{target}}(x|z)|^2] $$

其中$u_t^{\text{target}}(x|z)$可以人为构造获得（例如Gaussian probability path）

可以证明，

$$ \mathcal{L}{\text{FM}}(\theta) = \mathcal{L}{\text{CFM}}(\theta) + C $$

即

$$ \nabla_\theta \mathcal{L}{\text{FM}}(\theta) = \nabla\theta \mathcal{L}_{\text{CFM}}(\theta) $$

因此优化$\mathcal{L}{\text{CFM}}$即优化$\mathcal{L}{\text{FM}}$，而对于$\mathcal{L}_{\text{CFM}}$，只需构造probability path即可，至此可以得到训练Flow Model的算法，整个流程即称为Flow Matching

Flow Matching for Gaussian Conditional Probability Paths

对于Gaussian Probability Path，有

$$ \epsilon \sim \mathcal{N}(0, I_d) \quad \Rightarrow \quad x_t = \alpha_t z + \beta_t \epsilon \sim \mathcal{N}(\alpha_t z, \beta_t^2 I_d) = p_t(\cdot|z) $$

$$ u_t^{\mathrm{target}}(x|z)=\left(\dot{\alpha}_t-\frac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z+\frac{\dot{\beta}_t}{\beta_t}x $$

$$ \begin{gathered} \mathcal{L}{\mathrm{CFM}}(\theta)=\mathbb{E}{t\sim\mathrm{Unif},z\sim p_{\mathrm{data}},x\sim\mathcal{N}(\alpha_{t}z,\beta_{t}^{2}I_{d})}[|u_{t}^{\theta}(x)-\left(\dot{\alpha}{t}-\frac{\dot{\beta}{t}}{\beta_{t}}\alpha_{t}\right)z-\frac{\dot{\beta}{t}}{\beta{t}}x|^{2}] \ \overset{(i)}{\operatorname*{=}}\mathbb{E}{t\sim\mathrm{Unif},z\sim p{\mathrm{data}},\epsilon\sim\mathcal{N}(0,I_{d})}[|u_{t}^{\theta}(\alpha_{t}z+\beta_{t}\epsilon)-(\dot{\alpha}{t}z+\dot{\beta}{t}\epsilon)|^{2}] \end{gathered} $$

特别的，对于$\alpha_t=t$，$\beta_t=1-t$，有

$$ p_{t}(x|z)=\mathcal{N}(tz,(1-t)^{2}) $$

$$ \mathcal{L}{\mathrm{cfm}}(\theta)=\mathbb{E}{t\sim\mathrm{Unif},z\sim p_{\mathrm{data}},\epsilon\sim\mathcal{N}(0,I_{d})}[|u_{t}^{\theta}(tz+(1-t)\epsilon)-(z-\epsilon)|^{2}] $$

称之为(Gaussian) CondOT probability path，训练过程如下所示

Score Matching

对于Diffusion Models，由于$u_t^{target}$ 难以得到，因此使用score network $\sigma_t^2 : \mathbb{R}^d \times [0, 1] \to \mathbb{R}$对score function进行拟合，同样的，存在score matching loss和conditional score matching loss如下

$$ \begin{align*} \mathcal{L}{\text{SM}}(\theta) &= \mathbb{E}{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot|z)}[|s_t^\theta(x) - \nabla \log p_t(x)|^2] \quad \triangleright \text{ score matching loss} \ \mathcal{L}{\text{CSM}}(\theta) &= \mathbb{E}{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot|z)}[|s_t^\theta(x) - \nabla \log p_t(x|z)|^2] \quad \triangleright \text{ conditional score matching loss} \end{align*} $$

同样的，虽然$\nabla \log p_t(x)$未知，但$\nabla \log p_t(x \mid z)$可以人工构造，且存在

$$ \begin{align*} &\mathcal{L}{\text{SM}}(\theta) = \mathcal{L}{\text{SFM}}(\theta) + C \ &\implies \nabla_\theta \mathcal{L}{\text{SM}}(\theta) = \nabla\theta \mathcal{L}_{\text{CSM}}(\theta) \end{align*} $$

因此，优化$\mathcal{L}_{\text{CSM}}(\theta)$即可，此时采样过程如下所示

$$ X_0 \sim p_{\text{init}}, \quad \mathrm{d}X_t = \left[ u_t^\theta(X_t) + \frac{\sigma_t^2}{2} s_t^\theta(X_t) \right] \mathrm{d}t + \sigma_t \mathrm{d}W_t \implies X_1 \sim p_{data} $$

其中，尽管理论上对任意$\sigma_t \geq 0$均可实现采样，但由于存在对随机微分方程模拟不精确导致的精度误差，以及训练误差，因此存在一个最优的$\sigma_t$。同时观察采样过程可以发现模拟该SDE还需学习$u_t^\theta$，但其实通常可以使用一个两输出的网络同时处理$u_t^\theta$和$s_t^\theta$，并且对于特定的概率路径，两者可以相互转化。

Denoising Diffusion Models: Score Matching for Gaussian Probability Paths

对于Gaussian Probability Paths，有

$$ \nabla \log p_t(x|z) = -\frac{x - \alpha_t z}{\beta_t^2} $$

则

$$ \begin{align*} \mathcal{L}{\text{CSM}}(\theta) &= \mathbb{E}{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot|z)}\left[\left|s_t^\theta(x) + \frac{x - \alpha_t z}{\beta_t^2}\right|^2\right] \ &= \mathbb{E}{t \sim \text{Unif}, z \sim p{\text{data}}, \epsilon \sim \mathcal{N}(0, I_d)}\left[\left|s_t^\theta(\alpha_t z + \beta_t \epsilon) + \frac{\epsilon}{\beta_t}\right|^2\right] \ &= \mathbb{E}{t \sim \text{Unif}, z \sim p{\text{data}}, \epsilon \sim \mathcal{N}(0, I_d)}\left[\frac{1}{\beta_t^2} \left|\beta_t s_t^\theta(\alpha_t z + \beta_t \epsilon) + \epsilon\right|^2\right] \end{align*} $$

由于$\frac{1}{\beta^2_t}$在$\beta_t$趋近于0时loss趋近于无穷大，因此通常舍弃常数项$\frac{1}{\beta^2_t}$，并用以下方法reparameterize $s^\theta_t$为$\epsilon_t^\theta$（噪声预测网络）得到DDPM损失函数

$$ -\beta_t s_t^\theta(x) = \epsilon_t^\theta(x) \quad \Rightarrow \quad \mathcal{L}{\text{DDPM}}(\theta) = \mathbb{E}{t \sim \text{Unif}, z \sim p_{\text{data}}, \epsilon \sim \mathcal{N}(0, I_d)}\left[\left|\epsilon_t^\theta(\alpha_t z + \beta_t \epsilon) - \epsilon\right|^2\right] $$

其训练过程如下所示

此外，对于Gaussian Probability Paths，vector field和score可以相互转化，即

$$ u_t^{\text{target}}(x|z) = \left( \beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t \right) \nabla \log p_t(x|z) + \frac{\dot{\alpha}_t}{\alpha_t} x \ u_t^{\text{target}}(x) = \left( \beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t \right) \nabla \log p_t(x) + \frac{\dot{\alpha}_t}{\alpha_t} x $$

proof

$$ u_t^{\text{target}}(x|z) = \left( \dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t \right) z + \frac{\dot{\beta}_t}{\beta_t} x \stackrel{(i)}{=} \left( \beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t \right) \left( \frac{\alpha_t z - x}{\beta_t^2} \right) + \frac{\dot{\alpha}_t}{\alpha_t} x = \left( \beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t \right) \nabla \log p_t(x|z) + \frac{\dot{\alpha}_t}{\alpha_t} x $$

$$ \begin{align*} u_t^{\text{target}}(x) &= \int u_t^{\text{target}}(x|z) \frac{p_t(x|z) p_{\text{data}}(z)}{p_t(x)} ,\mathrm{d}z \ &= \int \left[ \left( \beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t \right) \nabla \log p_t(x|z) + \frac{\dot{\alpha}t}{\alpha_t} x \right] \frac{p_t(x|z) p{\text{data}}(z)}{p_t(x)} ,\mathrm{d}z \ &\stackrel{(i)}{=} \left( \beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t \right) \nabla \log p_t(x) + \frac{\dot{\alpha}_t}{\alpha_t} x \end{align*} $$

$u_t^\theta$和$s^\theta_t$ 也可以相互转化，有

$$ u_t^\theta(x) = \left( \beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t \right) s_t^\theta(x) + \frac{\dot{\alpha}_t}{\alpha_t} x $$

$$ s_t^\theta(x) = \frac{\alpha_t u_t^\theta(x) - \dot{\alpha}_t x}{\beta_t^2 \alpha_t - \alpha_t \dot{\beta}_t \beta_t} $$

因此对于Gaussian probability paths来说，只需训练$u_t^\theta$或$s^\theta_t$ 即可，且使用flow matching或者使用score matching的方法均可

最后，对于训练好的$s_t^\theta$ 从SDE中采样过程如下

$$ X_0 \sim p_{\text{init}}, \quad \mathrm{d}X_t = \left[ \left( \beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t + \frac{\sigma_t^2}{2} \right) s_t^\theta(x) + \frac{\dot{\alpha}t}{\alpha_t} x \right] \mathrm{d}t + \sigma_t \mathrm{d}W_t \ \implies X_1=p{data} $$

Summary

总的来说，Flow Matching比Score Matching更简洁并且Flow Matching更具有拓展性，可以实现从一个任意初始分布$p_{init}$得到任意分布$p_{data}$，但是denoising diffusion models只适用于Gaussian initial distributions and Gaussian probability path。Flow Matching类似于Stochastic Interpolants。

Conditional (Guided) Generation

在给定条件下进行生成（generate an object conditioned on some additional information），称之为conditional generation，为了和conditional vector field区分多称为guided generation

用数学语言描述即，对于$y \in \mathcal{Y}$，对$p_{data}(x \mid y)$中采样，因此模型包含条件向量场$u_t^{\theta}(\cdot \mid y)$，模型架构如下所示

$$ \begin{align*} \text{Neural network: } & u_t^\theta : \mathbb{R}^d \times \mathcal{Y} \times [0, 1] \to \mathbb{R}^d, \quad (x, y, t) \mapsto u_t^\theta(x|y) \ \text{Fixed: } & \sigma_t : [0, 1] \to [0, \infty), \quad t \mapsto \sigma_t \end{align*} $$

对于给定的$y \in \mathbb{R}^{d_y}$，采样过程可以描述为

$$ \begin{align*} \text{Initialization:} \quad & X_0 \sim p_{\text{init}} \quad &\triangleright \text{ Initialize with simple distribution} \ \text{Simulation:} \quad & \mathrm{d}X_t = u_t^\theta(X_t|y),\mathrm{d}t + \sigma_t,\mathrm{d}W_t \quad &\triangleright \text{ Simulate SDE from } t=0 \text{ to } t=1. \ \text{Goal:} \quad & X_1 \sim p_{\text{data}}(\cdot|y) \quad &\triangleright X_1 \text{ to be distributed like } p_{\text{data}}(\cdot|y) \end{align*} $$

上述在$\sigma_t=0$时即为guided flow model

Guided Models

Guided Flow Models的训练损失（优化目标，或者说guided conditional flow matching objective）很容的得到，如下所示

$$ \begin{align*} \mathcal{L}{\text{CFM}}^{\text{guided}}(\theta) &= \mathbb{E}{(z,y) \sim p_{\text{data}}(z,y),, t \sim \text{Unif}(0,1),, x \sim p_t(\cdot|z)} \left[ \left| u_t^\theta(x|y) - u_t^{\text{target}}(x|z) \right|^2 \right] \end{align*} $$

同样的，对于Guided Diffusion Models，有guided conditional score matching objective如下

$$ \begin{align*} \mathcal{L}{\text{CSM}}^{\text{guided}}(\theta) &= \mathbb{E}{\square} \left[ | s_t^\theta(x|y) - \nabla \log p_t(x|z) |^2 \right] \ \square &= (z, y) \sim p_{\text{data}}(z, y),\ t \sim \text{Unif}(0,1),\ x \sim p_t(\cdot|z) \end{align*} $$

虽然理论上上述以及足够生成标签$y$对应样本，但是实际上生成效果并不十分fit $y$，以及，无法控制生成内容对label的fit程度。一种解决方法是人为加强$y$的作用，比较先进的技术是Classifier-Free Guidance。

Classifier-Free Guidance

对于Flow Models，以Gaussian probability paths为例

$$ \begin{align*} u_t^{\text{target}}(x|y) = a_t x + b_t \nabla \log p_t(x|y) \end{align*} $$

其中

$$ \begin{align*} (a_t, b_t) = \left( \frac{\dot{\alpha}_t}{\alpha_t}, \frac{\dot{\alpha}_t \beta_t^2 - \dot{\beta}_t \beta_t \alpha_t}{\alpha_t} \right) \end{align*} $$

又

$$ \begin{align*} \nabla \log p_t(x|y) = \nabla \log \left( \frac{p_t(x) p_t(y|x)}{p_t(y)} \right) = \nabla \log p_t(x) + \nabla \log p_t(y|x) \end{align*} $$

则

$$ \begin{align*} u_t^{\text{target}}(x|y) = a_t x + b_t (\nabla \log p_t(x) + \nabla \log p_t(y|x)) = u_t^{\text{target}}(x) + b_t \nabla \log p_t(y|x) \end{align*} $$

可以看出，guided vector field是由unguided vector field和guided score相加得到，一种很自然的想法是对guided score进行加权，得到

$$ \begin{align*} \tilde{u}_t(x|y) = u_t^{\text{target}}(x) + wb_t \nabla \log p_t(y|x) \end{align*} $$

其中guided score可以看作是噪声类别分类器，早期的工作确实使用这样的方法实现，但是进一步对guided score进行分析得到如下：

$$ \begin{align*} \tilde{u}_t(x|y) &= u_t^{\text{target}}(x) + w_b \nabla \log p_t(y|x) \ &= u_t^{\text{target}}(x) + w_b (\nabla \log p_t(x|y) - \nabla \log p_t(x)) \ &= u_t^{\text{target}}(x) - (w_a x + w_b \nabla \log p_t(x)) + (w_a x + w_b \nabla \log p_t(x|y)) \ &= (1 - w) u_t^{\text{target}}(x) + w u_t^{\text{target}}(x|y). \end{align*} $$

即$\tilde{u}_t(x|y)$由unguided vector field和guided vector field加权得到，并且，通过构造$y = \varnothing$其对应概率为人为设计的超参数$\eta$，从而实现使用$u_t^{\text{target}}(x|\varnothing)$代替$u_t^{\text{target}}(x)$，具体可公式化描述为

$$ \begin{align*} \mathcal{L}{\text{CFM}}^{\text{CFG}}(\theta) &= \mathbb{E}{\square} \left[ | u_t^\theta(x|y) - u_t^{\text{target}}(x|z) |^2 \right] \ \square &= (z, y) \sim p_{\text{data}}(z, y),\ t \sim \text{Unif}(0,1),\ x \sim p_t(\cdot|z),\ \text{replace } y = \varnothing \text{ with prob. } \eta \end{align*} $$

对于Diffusion Models，$\tilde{s}_t(x|y)$同样可改写如下

$$ \begin{align*} \tilde{s}_t(x|y) &= \nabla \log p_t(x) + w \nabla \log p_t(y|x) \ &= \nabla \log p_t(x) + w (\nabla \log p_t(x|y) - \nabla \log p_t(x)) \ &= (1 - w) \nabla \log p_t(x) + w \nabla \log p_t(x|y) \ &= (1 - w) \nabla \log p_t(x|\varnothing) + w \nabla \log p_t(x|y) \end{align*} $$

training objective如下

$$ \begin{align*} \mathcal{L}{\text{CSM}}^{\text{CFG}}(\theta) &= \mathbb{E}{\square} \left[ | s_t^\theta(x|(1 - \xi)y + \xi \varnothing) - \nabla \log p_t(x|z) |^2 \right] \ \square &= (z, y) \sim p_{\text{data}}(z, y),\ t \sim \text{Unif}(0,1),\ x \sim p_t(\cdot|z),\ \text{replace } y = \varnothing \text{ with prob. } \eta \end{align*} $$

训练时，我们通常也可同时优化${s}_t^\theta(x|y)$和${u}_t^\theta(x|y)$，对应的，有

$$ \begin{align*} \tilde{s}_t^\theta(x|y) &= (1 - w) s_t^\theta(x|\varnothing) + w s_t^\theta(x|y), \ \tilde{u}_t^\theta(x|y) &= (1 - w) u_t^\theta(x|\varnothing) + w u_t^\theta(x|y). \end{align*} $$

采样时，有

$$ \mathrm{d}X_t = \left[ \tilde{u}_t^\theta(X_t|y) + \frac{\sigma_t^2}{2} s_t^\theta(X_t|y) \right] \mathrm{d}t + \sigma_t \mathrm{d}W_t $$

Network architectures

网络模型的设计随建模数据的复杂程度各有差别，但都需满足

$$ \text{Neural network: } u_t^\theta : \mathbb{R}^d \times \mathcal{Y} \times [0, 1] \to \mathbb{R}^d, \quad (x, y, t) \mapsto u_t^\theta(x|y) $$

U-Nets

Diffusion Transformers

References

[1] Peter Holderrieth and Ezra Erives.An Introduction to Flow Matching and Diffusion Models[EB/OL].https://arxiv.org/abs/2506.02070,2025.

智慧树试卷导出脚本

Fri, 13 Jun 2025 00:00:00 GMT

安装

先安装Tampermonkey（已安装请忽略）
点击这里配合渲染工具，或者直接安装这个

What Is 3D Rendering? Complete Guide to 3D Visualization

Sun, 09 Feb 2025 00:00:00 GMT

3D rendering is all around us. From huge action movies to car commercials to previews of upcoming buildings or product designs, 3D visualization has become so widespread and realistic that you probably don’t even know it’s there.

In this introductory piece, Chaos’ Ricardo Ortiz explains the basics of 3D rendering, from the computational methods that create imagery to the artistic techniques that create great computer-generated (CG) content and its various uses.

What is 3D Rendering?

Put simply, 3D rendering is the process of using a computer to generate a 2D image from a digital three-dimensional scene.

To generate an image, specific methodologies and special software and hardware are used. Therefore, we need to understand that 3D rendering is a process—the one that builds the image.

Types of 3D rendering

We can create different types of rendered image; they can be realistic or non-realistic.

A realistic image could be an architectural interior that looks like a photograph, a product-design image such as a piece of furniture, or an automotive rendering of a car. On the other hand, we can create a non-realistic image such as an outline-type diagram or a cartoon-style image with a traditional 2D look. Technically, we can visualize anything we can imagine.

How is 3D rendering used?

3D rendering is an essential technique for many industries including architecture, product design, advertising, video games and visual effects for film, TV and animation.

In design and architecture, renders allow creative people to communicate their ideas in a clear and transparent way. A render gives them the chance to evaluate their proposals, experiment with materials, conduct studies and contextualize their designs in the real world before they are built or manufactured.

For the media and entertainment industries, 3D rendering is fundamental to the creation of sequences and animations that tell stories, whether we’re watching an animated movie, a period drama, or an action sequence with explosions, ships from the future, exotic locales, or extraterrestrial creatures.

Over the past few years, the evolution of computer graphics in these industries has replaced traditional techniques. For example, special effects are being replaced by visual effects, which means stunt people no longer risk their lives in car crashes.

In advertising, I would dare to say that 90% of automotive commercials are CG—or even more. In the architecture industry, many traditional techniques to create representations, such as scale models, have been replaced with photorealistic imagery to ensure we can see exactly how something will look once it’s built.

Accelerating processes, reducing costs and the demand for better quality results have helped technology evolve. Hardware is more powerful than ever and the switch to CG was inevitable.

How is a 3D rendered image generated?

Two pieces of software, with different characteristics, are used to computer-generate images and animations: render engines and game engines. Render engines use a technique called ray tracing, while game engines use a technique called rasterization—and some engines mix both techniques, but we will talk about that later on.

FastAPI项目开发与部署

Sun, 16 Feb 2025 00:00:00 GMT

FastAPI 项目开发与部署笔记

本笔记总结了如何使用 FastAPI 构建一个模块化、可扩展的 API 系统，并通过 Docker 和 Docker Compose 实现高效的开发和部署流程。

1. 项目结构设计

为了构建一个清晰、易维护的项目，项目采用以下结构：

project/
├── main.py          # 主入口文件
├── routers/         # 路由模块
│   ├── __init__.py
│   ├── xiaohongshu/ # 小红书 API 文件夹
│   │   ├── __init__.py
│   │   └── image.py # 小红书图片解析 API
│   └── other_api/   # 其他功能 API 文件夹（未来扩展）
│       ├── __init__.py
│       └── example.py
├── utils/           # 工具模块
│   ├── __init__.py
│   └── parser.py    # 解析工具函数
└── models/          # 数据模型（如果需要）
    ├── __init__.py
    └── example.py

特点:

每个功能模块独立封装在 routers 文件夹下的子文件夹中。
动态加载路由，支持灵活扩展。

2. FastAPI 核心功能实现

(1) 路由定义

使用 APIRouter 定义模块化路由。例如：

from fastapi import APIRouter

router = APIRouter(prefix="/image", tags=["Image Parsing"])

@router.get("/")
async def parse_image(url: str):
    result = HongshuParser(url)
    return result

(2) 自动化文档

FastAPI 自动生成交互式文档页面：

Swagger UI: /docs
ReDoc: /redoc

可以通过以下方式定制文档页面：

修改标题：自定义 HTML 模板。
添加品牌化元素：如 Logo 和样式。

3. Docker 部署

(1) Dockerfile

Dockerfile 示例：

# syntax=docker/dockerfile:1.4
FROM --platform=$BUILDPLATFORM python:3.11

WORKDIR /app
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

COPY requirements.txt /app
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 4725/tcp
CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:4725", "app:app"]

(2) Docker Compose

通过 docker-compose.yml 简化多容器管理：

version: '3.9'
services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - '4725:4725'
    volumes:
      - .:/app
    command: >
      gunicorn -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:4725 app:app

优点:

使用 volumes 挂载本地代码，实时同步代码更改。
支持多服务管理（如数据库、缓存等）。

4. 更新代码后的重新运行

(1) 手动更新

停止并删除旧容器：

docker stop <docker 容器 ID | docker 容器名>
docker rm <docker 容器 ID | docker 容器名>

重新构建镜像并运行：

docker build -t <docker 容器名> .
docker run -d -p 4725:4725 --name <docker 容器名> <docker 镜像名>

(2) 使用 Docker Compose

重新构建并启动：

docker-compose up --build -d # -d 表示后台运行

如果挂载了本地代码，只需保存代码更改即可自动生效。

5. 项目拓展

(1) 添加新功能

新增功能非常简单，只需在 routers 文件夹下创建新的子文件夹，并按照以下步骤操作：

创建模块文件夹。
定义路由。
初始化模块。
测试新功能。

(2) 集成外部工具

通过依赖注入的方式集成外部工具或服务（如数据库、缓存等）。例如：

def get_db():
    db = "Database Connection"
    return db

@app.get("/example-with-db")
async def example_with_db(db=Depends(get_db)):
    return {"db": db}

项目代码

项目代码已上传至 Qiumo api, 部署置 Qiumo.fun

Time Machine

Mon, 03 Feb 2025 00:00:00 GMT

Everyone want to go back, but time waits for no one.

Staring at stars

Watching the moon

Hoping that one they'll lead me to you

Wait every night

Cause if a star falls

I'll wish to go back to the times that I loved

Why do the stars shine so bright in the sky

If most of the people are sleeping at night

Why do we only have one chance at life

I wish I could go back in time

Pictures remind me of the things I forget

But also of all of the things that I've lost

Can't get them back they won't fall from above

So I try to forget all the times that I loved

Why do we remember beautiful lies

We end up regretting them most of our lives

Why do we only have one chance to try

I wish I could go back in time

Each time I fall asleep

I always see you there in my dreams

It's like going back in a time mechine

I know when I wake up your time with me will end

So don't let me fall asleep

I don't wanna meet you there in my dreams

I know that we'll never build a time machine

It's time for me to try and wake up again

I fall asleep

But honestly

I wanna see you in my dreams

I'm trying to wake up again

千与千寻——只存在于梦中的童话故事

Sun, 02 Feb 2025 00:00:00 GMT

标准的 HE 故事，故事情节很精彩，主旨也很丰富。但是童话终究是童话。

比较深刻的一部分是关于无脸男。看电影前刚讨论过关于有的男生结婚前很好，但是结婚后会变坏；有的男生结婚前很坏，但是结婚后会变好。虽然变好变坏难以界定，但是仔细分析来说其变化无非出于内因和外因。无脸男只是一个容器，外界装入什么，就表现出什么。大浴场贪婪，其也就变得贪婪。钱婆婆和千寻善良，在她们的影响下他也不再贪婪。

这样来说，我们就很需要有非常强大的内心，和刚毅的坚守。像千寻一样，不被外界的贪婪干扰，再具体一点，就是保存本真。

是的，本真。忘记了自己的名字就忘记了自己是谁，忘记了自己的本真就会被别人控制，困扰。汤婆婆是这样来控制他人的。因此要记得自己的本心。

但是，世事无常，人生魔幻缤纷，能够保存自己本心的人少之又少，大抵到最后都拜倒在诱惑或者屈膝于生存。这样想来就有一种绝望感。

不过，宫崎骏像是给出了自己的答案，友情和爱情，千寻拯救无脸男，琥珀川和千寻相互救赎，最后事情都被解决，我们都有美好的未来。

然而，这仔细想来更让人有点失落，知己难以遇到，真爱更是如此，如果孤身一人看这部电影，初看很温馨，但是回过味来便很让人哭泣。带入千寻，或许当初掉进琥珀川的时候可能就被淹死了，闯进神明世界的时候就独自消失了，进到汤屋的时候就被变成了猪或者煤球。一切的一切都像是巧合，都只是童话。

这样去想又像是个消极主义者了，是我是消极主义者，还是世界影响着我让我变成了消极主义者？我是我，还是世界造就了我？

或许应该乐观点，自信点，宏大点。如果不能被理解，可以去理解他人；如果不能被救赎，去努力救赎他人。成为一个理想主义者，观察者和记录者。

电影中的名字也挺有特色，汤婆婆喜欢钱却却姓汤，钱婆婆不痴迷钱却姓钱，无脸男有脸却无心；千寻千寻，寻找的既是自己，也是自己爱的人。每个人的名字都很有意义，我要取个什么名字？

思考真的可以产生热量，刚刚还非常冷的手和胸膛，现在也火热了起来。

空山新雨后

Mon, 20 Jan 2025 00:00:00 GMT

空山新雨后

空山新雨后，天气晚来秋

明月松间照，清泉石上流

山峰轻摆尾

卷下落花随流水

路过擦拭曾经用你柔情换我的眼泪

当爱恨都败退

没谢幕的人啊

井中月举杯砸碎佐一场宿醉

抽签的玫瑰

作熏香还(hai)能余味

猜测无解答案算了满地也是种浪费

我才终于明白

终于明白

不能被施舍的是爱

取下褪漆的钗

就化作尘埃

喝多少暖身的酒

暖不了心口

待空山新雨后

放一叶小舟

载上无人问津的温柔

摆渡寻处去忘忧

抽签的玫瑰

作熏香还能余味

猜测无解答案算了满地也是种浪费

我才终于明白

终于明白

不能被施舍的是爱

取下褪漆的钗

就化作尘埃

喝多少暖身的酒

暖不了心口

待空山新雨后

放一叶小舟

载上无人问津的温柔

摆渡寻处去忘忧

ax650交叉编译ax-pipeline

Wed, 19 Jun 2024 00:00:00 GMT

ax650交叉编译ax-pipeline

编译前准备

x86 Linux系统，虚拟机或者实体机，推荐选择Ubuntu 22.04
稳定网络环境(需要连接github)，若下载出现问题可参考此处
U盘
安装基础编译包

sudo apt update
sudo apt install build-essential libopencv-dev cmake

交叉编译

拉取ax-pipeline源码及子模块

git clone --recursive https://github.com/AXERA-TECH/ax-pipeline.git

下载sdk及设置650n_bsp_sdk版本

cd ax-pipeline
./download_ax_bsp.sh ax650
./switch_version_ax650.sh 1.45
cd ax650n_bsp_sdk
wget https://github.com/ZHEQIUSHUI/assets/releases/download/ax650/drm.zip
mkdir third-party
unzip drm.zip -d third-party
cd ..

下载opencv

mkdir 3rdparty
cd 3rdparty
wget https://github.com/ZHEQIUSHUI/assets/releases/download/ax650/libopencv-4.5.5-aarch64.zip
unzip libopencv-4.5.5-aarch64.zip

配置交叉编译器

wget https://developer.arm.com/-/media/Files/downloads/gnu-a/9.2-2019.12/binrel/gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu.tar.xz
tar -xvf gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu.tar.xz
export PATH=$PATH:$PWD/gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu/bin/

源码编译

cd ax-pipeline
mkdir build
cd build
cmake -DAXERA_TARGET_CHIP=AX650 -DBSP_MSP_DIR=$PWD/../ax650n_bsp_sdk/msp/out -DOpenCV_DIR=$PWD/../3rdparty/libopencv-4.5.5-aarch64/lib/cmake/opencv4 -DSIPY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-none-linux-gnu.toolchain.cmake -DCMAKE_INSTALL_PREFIX=install ..
make -j12
make install

获得bin文件如下所示

bin
├── config
│   ├── custom_model.json
│   ├── dinov2.json
│   ├── dinov2_depth.json
│   ├── glpdepth.json
│   ├── ppyoloe.json
│   ├── scrfd.json
│   ├── scrfd_recognition.json
│   ├── yolo_nas.json
│   ├── yolov5_seg.json
│   ├── yolov5s.json
│   ├── yolov5s_face.json
│   ├── yolov5s_face_recognition.json
│   ├── yolov6.json
│   ├── yolov7.json
│   ├── yolov7_face.json
│   ├── yolov8.json
│   ├── yolov8_pose.json
│   └── yolox.json
├── sample_demux_ivps_npu_hdmi_vo
├── sample_demux_ivps_npu_rtsp
├── sample_demux_ivps_npu_rtsp_hdmi_vo
├── sample_multi_demux_ivps_npu_hdmi_vo
├── sample_multi_demux_ivps_npu_multi_rtsp
├── sample_multi_demux_ivps_npu_multi_rtsp_hdmi_vo
├── sample_vin_ivps_npu_hdmi_vo
└── sample_vin_ivps_npu_venc_rtsp

移动到开发板

由于编译后文件较大，因此推荐使用U盘进行数据传输

将编译后bin文件移动到U盘中
U盘插入板卡中
查看U盘所在分区

如图所示，我的U盘所在分区为/dev/sda1 (根据大小或者其他来判断)

挂载到文件夹中(此处挂载到了/mnt/usb文件夹下)

mkdir /mnt/usb
mount /dev/sda1 /mnt/usb

可能会有以下提示，不影响

查看是否挂载

移动文件到板卡中(此处创建了~/data目录，并将文件移动到了~/data/下)

mkdir ~/data
cp /mnt/usb/bin ~/data -r

查看文件

运行默认示例，不传入模型参数(记得kill fb_vo进程)

cd ~/data/bin
./sample_vin_ivps_npu_hdmi_vo

移除U盘

卸载U盘

umount /dev/sda1 /mnt/usb

即可拔掉U盘

github镜像加速下载

git拉取ax-pipeline源码加速

git clone https://kkgithub.com/AXERA-TECH/ax-pipeline.git
cd ax-pipeline

修改ax-pipeline下.gitmodules文件，将url =中所有github.com换为kkgithub.com

拉取子模块

git submodule update --init
./download_ax_bsp.sh ax650

wget文件加速

替换wget下载链接中github.com为kkgithub.com

侧耳倾听——阅读、爱情与理想

Fri, 25 Aug 2023 00:00:00 GMT

我喜欢上了你努力的样子，因此我也变得越加的努力吸引你的注意。

两个互相振奋的灵魂、一个变得更加优秀的约定、一份纯洁无暇的爱情。

就让这些种下未来的种子，在约定的时刻我们相见。

想变得更加优秀，在未来的某个时间点上和她一起

便如此这般

萤火之森——终将别离的爱恋

Wed, 23 Aug 2023 00:00:00 GMT

时光终有一天会将我们分开，但是。即使如此，在那日降临之前，让我们一直在一起吧。

如果我和爱的人无法触碰，无法拥抱，那么我大概是要发疯的。

然而如果触碰便意味着别离，那么失去或许是成全。

不过如同开头所说的那句话，我们终将分离，但是在我们仍未分离的时光里，快乐的生活着吧。

若我是萤，面对终将分离的爱情时，我大概率是会离开的吧，不想面对，别离时的悲伤。

如果终将分离，倒不如让情还未深的时候结束，让痛苦来的更早些短暂些。

这是现在的我。

使用PyQt5开发应用程序总结

Fri, 11 Aug 2023 00:00:00 GMT

PyQt5 使用笔记

PyQt5 是一个用于创建图形用户界面(GUI)的 Python 框架，基于 Qt 库开发而来。它提供了丰富的工具和组件，使开发者能够轻松地创建各种强大的桌面应用程序。本文将介绍 PyQt5 的基本用法，并提供一些示例代码帮助你入门。

安装 PyQt5

首先，需要安装 PyQt5 模块。你可以使用 pip 命令来安装：

pip install PyQt5

创建一个基本的 PyQt5 窗口

在 PyQt5 中，你可以通过两种方法来创建窗口：

面向对象编程： 这种方法涉及创建一个继承自特定窗口类的新类，并在新类中重写需要的方法来配置界面和处理事件。这种方法更加面向对象，可以更好地组织和管理代码。
直接编写代码： 这种方法涉及直接编写代码来创建窗口和组件，然后配置属性和信号槽等。这种方法更加直接，适用于一些简单的界面或快速原型开发。

下面分别展示了这两种方法的示例：

面向对象编程

import sys
from PyQt5.QtWidgets import QApplication, QMainWindow, QPushButton

class MyWindow(QMainWindow):
    def __init__(self):
        super().__init__()

        self.setWindowTitle("My Window")

        self.button = QPushButton("Click me", self)
        self.button.setGeometry(50, 50, 100, 30)
        self.button.clicked.connect(self.on_button_click)

    def on_button_click(self):
        print("Button clicked")

app = QApplication(sys.argv)
window = MyWindow()
window.show()
sys.exit(app.exec_())

直接编写代码

import sys
from PyQt5.QtWidgets import QApplication, QMainWindow, QPushButton

app = QApplication(sys.argv)
window = QMainWindow()
window.setWindowTitle("My Window")

button = QPushButton("Click me", window)
button.setGeometry(50, 50, 100, 30)
button.clicked.connect(lambda: print("Button clicked"))

window.show()
sys.exit(app.exec_())

无论你选择哪种方法，都可以根据项目需求来灵活调整和扩展代码。如果界面较为复杂或需要更好的代码组织，建议使用面向对象编程。如果界面简单且直接，可以选择直接编写代码。

以下是一个使用面向对象编程简单的示例代码，展示了如何创建一个基本的 PyQt5 窗口：

import sys
from PyQt5.QtWidgets import QApplication, QMainWindow

class MyWindow(QMainWindow):
    def __init__(self):
        super().__init__()
        self.setWindowTitle("My PyQt5 Window")
        self.setGeometry(100, 100, 800, 600)

if __name__ == "__main__":
    app = QApplication(sys.argv)
    window = MyWindow()
    window.show()
    sys.exit(app.exec_())

在这个示例中，我们首先导入了必要的模块，然后创建了一个继承自 QMainWindow 的自定义窗口类 MyWindow。在 __init__ 构造函数中，我们设置了窗口的标题和初始大小。最后，我们创建了一个应用对象并显示窗口。

常用的 PyQt5 组件

当使用 PyQt5 创建图形用户界面时，会涉及多种常用的组件，每个组件都有其特定的属性和用法。以下是一些常用组件的用法：

QLabel（标签）

标签用于显示文本或图像，可以用来展示信息、标题、说明等。常用属性和方法包括：

setText(text)：设置标签的文本内容。
text()：获取标签的文本内容。
setPixmap(pixmap)：设置标签显示的图像。
setAlignment(alignment)：设置文本对齐方式。
setFont(font)：设置字体。

from PyQt5.QtWidgets import QLabel

label = QLabel("Hello, PyQt5")
label.setAlignment(Qt.AlignCenter)
label.setFont(QFont("Arial", 12, QFont.Bold))

QLineEdit（单行文本输入框）

单行文本输入框用于接收用户输入的文本，例如用户名、密码等。常用属性和方法包括：

setText(text)：设置文本框的初始文本。
text()：获取用户输入的文本内容。
setPlaceholderText(text)：设置提示文本。

from PyQt5.QtWidgets import QLineEdit

line_edit = QLineEdit()
line_edit.setPlaceholderText("Enter your name")

QTextEdit（多行文本输入框）

多行文本输入框用于接收多行文本输入，支持富文本格式。常用属性和方法包括：

setText(text)：设置文本框的初始文本。
toPlainText()：获取用户输入的纯文本内容。
insertHtml(html)：插入富文本内容。

from PyQt5.QtWidgets import QTextEdit

text_edit = QTextEdit()
text_edit.insertHtml("<b>Hello</b>, <i>PyQt5</i>")

QComboBox（下拉框）

下拉框提供了一组选项供用户选择。常用属性和方法包括：

addItem(item)：添加选项。
addItems(items)：批量添加选项。
currentIndex()：获取当前选中的选项索引。
currentText()：获取当前选中的选项文本。
activated.connect(slot)：连接选项激活的信号。

from PyQt5.QtWidgets import QComboBox

combo_box = QComboBox()
combo_box.addItem("Option 1")
combo_box.addItems(["Option 2", "Option 3"])
selected_index = combo_box.currentIndex()
selected_text = combo_box.currentText()
combo_box.activated.connect(on_combo_box_activated)

QPushButton（按钮）

按钮用于触发特定操作或事件。常用属性和方法包括：

setText(text)：设置按钮显示的文本。
clicked.connect(slot)：连接按钮点击事件的信号。

from PyQt5.QtWidgets import QPushButton

button = QPushButton("Click me")
button.clicked.connect(on_button_click)

QCheckBox（复选框）

复选框用于表示一个二选一的选项。常用属性和方法包括：

isChecked()：检查复选框是否被选中。
text()：获取复选框的文本内容。
toggled.connect(slot)：连接复选框状态变化的信号。

from PyQt5.QtWidgets import QCheckBox

check_box = QCheckBox("Check me")
checked = check_box.isChecked()
check_box.toggled.connect(on_check_box_toggled)

QRadioButton（单选按钮）

单选按钮用于从多个选项中选择一个。常用属性和方法包括：

isChecked()：检查单选按钮是否被选中。
text()：获取单选按钮的文本内容。
toggled.connect(slot)：连接单选按钮状态变化的信号。

from PyQt5.QtWidgets import QRadioButton

radio_button = QRadioButton("Option 1")
checked = radio_button.isChecked()
radio_button.toggled.connect(on_radio_button_toggled)

QSlider（滑块）

滑块用于选择一个范围内的值。常用属性和方法包括：

setRange(minimum, maximum)：设置滑块的范围。
setValue(value)：设置滑块的当前值。
value()：获取滑块的当前值。
sliderMoved.connect(slot)：连接滑块移动事件的信号。

from PyQt5.QtWidgets import QSlider

slider = QSlider(Qt.Horizontal)
slider.setRange(0, 100)
slider.setValue(50)
slider.sliderMoved.connect(on_slider_moved)

QProgressBar（进度条）

进度条用于显示任务的进度。常用属性和方法包括：

setRange(minimum, maximum)：设置进度条的范围。
setValue(value)：设置进度条的当前值。
value()：获取进度条的当前值。

from PyQt5.QtWidgets import QProgressBar

progress_bar = QProgressBar()
progress_bar.setRange(0, 100)
progress_bar.setValue(75)

QSpinBox（数值输入框）

数值输入框用于输入整数值。常用属性和方法包括：

setRange(minimum, maximum)：设置数值输入框的范围。
setValue(value)：设置数值输入框的当前值。
value()：获取数值输入框的当前值。

from PyQt5.QtWidgets import QSpinBox

spin_box = QSpinBox()
spin_box.setRange(0, 100)
spin_box.setValue(50)

QDateTimeEdit（日期时间输入框）

日期时间输入框用于输入日期和时间。常用属性和方法包括：

setDateTime(datetime)：设置日期时间输入框的日期时间。
dateTime()：获取日期时间输入框的日期时间。

from PyQt5.QtWidgets import QDateTimeEdit

date_time_edit = QDateTimeEdit()
date_time_edit.setDateTime(QDateTime.currentDateTime())

QFileDialog（文件对话框）

文件对话框用于选择文件或目录。常用方法包括：

getOpenFileName()：打开文件选择对话框并返回选择的文件路径。
getSaveFileName()：打开文件保存对话框并返回选择的文件路径。
getExistingDirectory()：打开目录选择对话框并返回选择的目录路径。

from PyQt5.QtWidgets import QFileDialog

file_path, _ = QFileDialog.getOpenFileName(None, "Open File", "", "All Files (*.*)")

QMessageBox（消息框）

消息框用于显示提示、警告或错误信息。常用方法包括：

information(parent, title, text)：显示信息提示框。
warning(parent, title, text)：显示警告提示框。
critical(parent, title, text)：显示错误提示框。
question(parent, title, text)：显示询问提示框。

from PyQt5.QtWidgets import QMessageBox

QMessageBox.information(None, "Info", "This is an information message.")

布局管理

在 PyQt5 中，布局管理用于自动排列和定位组件，以便适应不同窗口大小。以下是一些常用的布局类型和使用示例：

QGridLayout（网格布局）

网格布局将组件按照行和列的方式排列。常用方法包括：

addWidget(widget, row, column, rowSpan, columnSpan)：将组件添加到指定行列位置，可跨行列。

from PyQt5.QtWidgets import QGridLayout

grid = QGridLayout()
grid.addWidget(label, 0, 0)
grid.addWidget(line_edit, 1, 0)
grid.addWidget(text_edit, 2, 0, 2, 1)

QVBoxLayout（垂直布局）

垂直布局将组件按垂直方向排列。常用方法包括：

addWidget(widget)：将组件按顺序添加到布局。

from PyQt5.QtWidgets import QVBoxLayout

vbox = QVBoxLayout()
vbox.addWidget(button1)
vbox.addWidget(button2)

QHBoxLayout（水平布局）

水平布局将组件按水平方向排列。常用方法包括：

addWidget(widget)：将组件按顺序添加到布局。

from PyQt5.QtWidgets import QHBoxLayout

hbox = QHBoxLayout()
hbox.addWidget(button1)
hbox.addWidget(button2)

这些是一些常用的 PyQt5 组件和布局，通过合理地使用它们，你可以创建出丰富多彩的图形用户界面。根据项目的需求，你可以灵活地选择合适的组件和布局方式。

布局管理使得窗口中的组件自动适应并排列，无需手动调整位置和大小。

多线程与线程间通信

创建线程

在 PyQt5 中，可以使用 QThread 类来创建线程。为了创建一个自定义线程，需要继承 QThread 并重写其 run 方法，将耗时操作放在 run 方法中执行。

from PyQt5.QtCore import QThread

class MyThread(QThread):
    def run(self):
        # 耗时操作
        pass

在线程间传递信号

在多线程应用中，线程之间的通信是常见的需求。PyQt5 提供了信号与槽机制来实现线程间的通信。可以通过自定义信号，在一个线程中发射信号，然后在另一个线程中连接该信号到槽函数来接收信号。

from PyQt5.QtCore import QThread, pyqtSignal

class MyThread(QThread):
    my_signal = pyqtSignal(str)  # 自定义信号，传递参数为 str 类型

    def run(self):
        # 耗时操作
        result = "耗时操作的结果"
        self.my_signal.emit(result)  # 发射信号

主线程接收信号

主线程可以连接自定义信号的槽函数，以接收在子线程中发射的信号。

class MainWindow(QMainWindow):
    def __init__(self):
        super().__init__()
        self.thread = MyThread()
        self.init_ui()

    def init_ui(self):
        # ... 初始化界面 ...

        self.thread.my_signal.connect(self.update_label)  # 连接信号和槽函数

    def update_label(self, result):
        # 更新界面

安全退出子线程

为了确保线程的安全退出，可以在窗口关闭事件中停止子线程并等待其完成。

class MainWindow(QMainWindow):
    # ... 其他代码 ...

    def closeEvent(self, event):
        if self.thread.isRunning():
            self.thread.quit()  # 停止线程
            self.thread.wait()  # 等待线程完成
        event.accept()

进一步解释：

self.my_signal.emit(result)：这行代码在子线程中发射了一个自定义信号 my_signal，并传递了参数 result。这个信号可以携带任意数量和类型的参数，这里我们传递了一个字符串 result。
self.thread.my_signal.connect(self.update_label)：这行代码在主线程中连接了子线程发射的信号 my_signal 到主线程的槽函数 update_label。这样一旦子线程发射了信号，主线程就会调用 update_label 方法来处理这个信号。
def update_label(self, result):：这是主线程中的槽函数。当子线程发射信号时，主线程会调用这个函数，并将子线程传递的参数 result 作为参数传递给这个函数。因此，result 确实代表了子线程传递的 result。

关于def update_label(self, result): 中的参数名参数名只是一个标识符，它并不影响信号的传递和槽函数的调用。

例如，你可以这样修改函数定义：

```python
def update_label(self, data):
# 使用 data 参数进行处理
```

然后在连接信号时，也需要相应地修改：

```python
self.thread.my_signal.connect(self.update_label)
```

只要信号和槽函数的参数类型匹配，无论参数名是什么，信号传递的参数都能够被成功传递给槽函数进行处理。

关于传递参数的类型

在 PyQt5 中，你可以使用自定义信号来传递多种类型的参数。除了 str 类型，还可以传递以下常用的参数类型：

int：整数类型。
float：浮点数类型。
bool：布尔类型。
list 或 tuple：列表或元组类型，可以传递多个参数。
object：Python 对象，可以传递任意类型的参数。

需要注意的是，信号和槽函数的参数类型必须匹配，否则会引发错误。当然，你也可以使用 pyqtSignal(object) 来传递任意类型的参数，但在槽函数内部需要根据参数类型进行适当的处理。

以下是一个示例，展示了如何使用不同类型的参数传递自定义信号：

from PyQt5.QtCore import pyqtSignal, QObject

class MyObject(QObject):
    my_signal_int = pyqtSignal(int)
    my_signal_float = pyqtSignal(float)
    my_signal_bool = pyqtSignal(bool)
    my_signal_list = pyqtSignal(list)
    my_signal_object = pyqtSignal(object)

    def send_signals(self):
        self.my_signal_int.emit(42)
        self.my_signal_float.emit(3.14)
        self.my_signal_bool.emit(True)
        self.my_signal_list.emit([1, 2, 3])
        self.my_signal_object.emit("Hello from signal!")

def my_slot(data):
    print("Received:", data)

obj = MyObject()
obj.my_signal_int.connect(my_slot)
obj.my_signal_float.connect(my_slot)
obj.my_signal_bool.connect(my_slot)
obj.my_signal_list.connect(my_slot)
obj.my_signal_object.connect(my_slot)

obj.send_signals()

在上述示例中，我们定义了一个 MyObject 类，它包含了不同类型的自定义信号。然后，我们通过连接这些信号到同一个槽函数 my_slot 来展示如何传递不同类型的参数。在槽函数内部，我们可以根据参数的类型来进行相应的处理。

多线程进阶

当涉及多线程编程和线程间通信时，以下是一些重要的概念和技术

互斥锁和信号量：

互斥锁用于保护共享资源，以确保在任何时候只有一个线程可以访问资源。信号量用于限制同时访问资源的线程数量。

from PyQt5.QtCore import QMutex, QSemaphore, QThread

class SharedResource:
    def __init__(self):
        self.mutex = QMutex()  # 创建互斥锁
        self.semaphore = QSemaphore(3)  # 创建信号量，允许3个线程同时访问

    def access_resource(self):
        self.semaphore.acquire()  # 获取信号量
        self.mutex.lock()  # 上锁
        # 访问和操作共享资源
        self.mutex.unlock()  # 解锁
        self.semaphore.release()  # 释放信号量

class WorkerThread(QThread):
    def __init__(self, resource):
        super().__init__()
        self.resource = resource

    def run(self):
        self.resource.access_resource()

resource = SharedResource()
threads = [WorkerThread(resource) for _ in range(5)]

for thread in threads:
    thread.start()

线程池：

线程池可以有效地管理和调度多个线程执行任务，避免频繁地创建和销毁线程。

from PyQt5.QtCore import QThreadPool, QRunnable, Qt

class Task(QRunnable):
    def __init__(self, task_id):
        super().__init__()
        self.task_id = task_id

    def run(self):
        print(f"Task {self.task_id} is running in thread {int(QThread.currentThreadId())}")

pool = QThreadPool.globalInstance()

for i in range(5):
    task = Task(i)
    pool.start(task)

定时器和延迟：

使用定时器可以在一段时间后触发任务，避免阻塞线程。

from PyQt5.QtCore import QTimer, pyqtSlot

class TimerExample:
    def __init__(self):
        self.timer = QTimer()
        self.timer.timeout.connect(self.on_timer_timeout)
        self.timer.start(1000)  # 每秒触发一次

    @pyqtSlot()
    def on_timer_timeout(self):
        print("Timer triggered")

example = TimerExample()

线程间通信的其他方式：

除了信号和槽函数，还可以使用队列来在线程之间传递数据。

import queue
from PyQt5.QtCore import QThread, pyqtSlot

class QueueExample(QThread):
    def __init__(self):
        super().__init__()
        self.message_queue = queue.Queue()

    def run(self):
        while True:
            message = self.message_queue.get()
            if message == "exit":
                break
            print(f"Received message: {message}")

    def send_message(self, message):
        self.message_queue.put(message)

example = QueueExample()
example.start()
example.send_message("Hello")
example.send_message("World")
example.send_message("exit")

总结

本文介绍了如何使用 PyQt5 创建常见的 GUI 组件，包括标签、按钮、文本框、下拉框、复选框和绘图区域，以及如何使用布局管理来排列这些组件。

应用

应用以上方法，笔者试着写了一个简易串口调试助手 MA-SerialDebugger，欢迎使用并提出改进意见。

MATLAB中的算数运算命令

Sun, 05 Jun 2022 00:00:00 GMT

MATLAB中集合操作

Fri, 03 Jun 2022 00:00:00 GMT

函数

描述

intersect(A,B)

设置两个数组的交集；返回A和B所共有的值。返回的值按排序顺序排列。

intersect(A,B,'rows')

将A和B的每一行作为单个实体处理，并返回A和B的公共行。返回的矩阵的行按排序顺序排列。

ismember(A,B)

返回与A大小相同的数组，包含1（true），其中A的元素在其他地方的B中找到，它返回0（false）。

ismember(A,B,'rows')

将A和B的每一行作为单个实体处理，并返回一个包含1（true）的向量，其中矩阵A的行也是B的行；否则，它返回0（false）。

issorted(A)

如果A的元素按排序顺序返回逻辑1（true），否则返回逻辑0（false）。输入A可以是向量，也可以是N-by-1或1-by-N的字符串数组。如果A和sort（A）的输出相等，则A被认为是排序的。

issorted(A, 'rows')

如果二维矩阵A的行按排序顺序返回逻辑1（真），否则返回逻辑0（假）。如果A和排序（A）的输出相等，则认为矩阵A被排序。

setdiff(A,B)

设置两个数组的差值；返回不在B中的值。返回数组中的值按排序顺序排列。

setdiff(A,B,'rows')

将每一行A和B行作为单个实体处理，并返回一个不在B中的行。返回的矩阵的行按排序顺序排列。

“行”选项不支持单元格数组。

setxor

设置两个数组的异或

union

设置两个数组的并集

unique

数组中唯一的值

MATLAB中函数详解

Thu, 02 Jun 2022 00:00:00 GMT

函数定义在单独的文件中，函数和函数的文件名应该是相同的。

函数语句的语法是：

function [out1,out2, ..., outN] = myfun(in1,in2,in3, ..., inN)

in1,in2...是输入out1,out2...输出

eg: 下述有个 mymax 函数，它需要五个数字作为参数并返回最大的数字。

建立函数文件，命名为 mymax.m 并输入下面的代码：

function max = mymax(n1, n2, n3, n4, n5)
%This function calculates the maximum of the
% five numbers given as input
max =  n1;
if(n2 > max)
    max = n2;
end
if(n3 > max)
   max = n3;
end
if(n4 > max)
    max = n4;
end
if(n5 > max)
    max = n5;
end

MATLAB匿名函数

一个匿名的函数就像是在传统的编程语言，在一个单一的 MATLAB 语句定义一个内联函数。

它由一个单一的 MATLAB 表达式和任意数量的输入和输出参数。

在MATLAB命令行或在一个函数或脚本可以定义一个匿名函数。

这种方式，可以创建简单的函数，而不必为他们创建一个文件。

建立一个匿名函数表达式的语法如下：

f = @(arglist)expression

详细例子

在这个例子中，我们将编写一个匿名函数 power，这将需要两个数字作为输入并返回第二个数字到第一个数字次幂。

在MATLAB中建立一个脚本文件，并输入下述代码：

power = @(x, n) x.^n;
result1 = power(7, 3)
result2 = power(49, 0.5)
result3 = power(10, -10)
result4 = power (4.5, 1.5)

运行该文件时，显示结果：

result1 =
   343
result2 =
     7
result3 =
   1.0000e-10
result4 =
    9.5459

MATLAB中矩阵的使用

Wed, 01 Jun 2022 00:00:00 GMT

创建矩阵

在MATLAB中创建矩阵有以下规则：

矩阵元素必须在 “[ ]” 内；
矩阵的同行元素之间用空格（或 “,”）隔开；
矩阵的行与行之间用 “;”（或回车符）隔开；
矩阵的元素可以是数值、变量、表达式或函数；
矩阵的尺寸不必预先定义。

矩阵索引

如果要引用 mth 行和 nth 列的一个元素，写法如下：

mx(m, n);

索引整列

a = [ 1 2 3 4 5; 2 3 4 5 6; 3 4 5 6 7; 4 5 6 7 8];
v = a(:,4)

矩阵赋值

MATLAB中多项式详解

Wed, 01 Jun 2022 00:00:00 GMT

MATLAB表示多项式为包含由下降幂排列的系数的行向量。

计算多项式的值

polyval()函数 eg:

p = [1 7 0 -5 9];
polyval(p,4)

polyvalm()函数用于评估计算矩阵多项式 eg:

p = [1 7 0 -5 9];
X = [1 2 -3 4; 2 -5 6 3; 3 1 0 2; 5 -7 3 8];
polyvalm(p, X)

计算多项式的根

roots函数计算多项式的根。例如，要计算多项式p的根，可参考以下语法 -

p = [1 7 0  -5 9];
r = roots(p)

poly函数是roots函数的逆，并返回到多项式系数。例如 -

p = [1 7 0  -5 9];
r = roots(p)
p2 = poly(r)

MATLAB执行上述代码语句返回以下结果 -

Trial>> p = [1 7 0  -5 9];
r = roots(p)
p2 = poly(r)

r =

  -6.8661 + 0.0000i
  -1.4247 + 0.0000i
   0.6454 + 0.7095i
   0.6454 - 0.7095i


p2 =

    1.0000    7.0000    0.0000   -5.0000    9.0000

多项式曲线拟合

polyfit函数用来查找一个多项式的系数，它符合最小二乘法中的一组数据。如果x和y包含要拟合到n度多项式的x和y数据的两个向量，则得到通过拟合数据的多项式，参考代码 -

p = polyfit(x,y,n)

示例

创建脚本文件并键入以下代码 -

x = [1 2 3 4 5 6]; y = [5.5 43.1 128 290.7 498.4 978.67];  %data
p = polyfit(x,y,4)   %get the polynomial
% Compute the values of the polyfit estimate over a finer range,
% and plot the estimate over the real data values for comparison:
x2 = 1:.1:6;
y2 = polyval(p,x2);
plot(x,y,'o',x2,y2)
grid on

MATLAB执行上述代码语句返回以下结果 -

Trial>> x = [1 2 3 4 5 6]; y = [5.5 43.1 128 290.7 498.4 978.67];  %data
p = polyfit(x,y,4)   %get the polynomial
% Compute the values of the polyfit estimate over a finer range,
% and plot the estimate over the real data values for comparison:
x2 = 1:.1:6;
y2 = polyval(p,x2);
plot(x,y,'o',x2,y2)
grid on

p =

    4.1056  -47.9607  222.2598 -362.7453  191.1250

同时还输出一个图形 -

MATLAB入门

Mon, 16 May 2022 00:00:00 GMT

向量

列向量 x = [1 ; 2 ; 3 ; 4 ; 5] 以分号分隔每一列

行向量x = [1 2 3 4 5]或者[1,2,3,4,5] 以空格或者逗号分隔

矩阵x = [1 2 3;4 5 6;7 8 9]

Matlab运算符

| 运算符 | 目的 | | :----: | :--------------------------: | | + | 加法运算符 | | - | 减法运算符 | | * | 标量和矩阵乘法 | | * | 标量和矩阵乘法 | | ^ | 标量和矩阵求幂 | | .^ | 数组求幂 | | \ | 矩阵左除 | | / | 矩阵右除 | | .\ | 阵列左除 | | ./ | 阵列右除 | | : | 向量生成；子阵列提取 | | . | 点乘运算，搭配使用 | | ... | 续行符 | | , | 分行符，结果不显示 | | ; | 语句结束；分行符（结果显示） | | % | 注释符 | | _ | 引用和转置符 | | ._ | 非共轭转置 | | () | 下标运算；参数定义 |

Matlab特殊变量与常量 |Name|Meaning| |:-----:|:----:| |ans|计算结果的变量名| |eps|浮点数的相对误差| |i,j|虚数单位，$i^2 = j^2 = -1$| |inf|无穷大| |NaN|不定值| |pi|圆周率| Matlab命令 |命令|作用| |:---:|:---:| |clc|清除命令窗口| |clear|从内存中删除变量| |exist|检查存在的文件或变量| |global|声明全局变量| |disp|显示一个数组或字符串的内容| |fscanf|阅读从文件格式的数据| |format|控制屏幕显示的格式| |fprintf|格式化输出屏幕或文件| |input|显示并的等待输出 |;|禁止显示网版印刷==？==|

MATLAB绘图命令

数据类型转换函数

a2b() a是要转换的数据类型，b是要转化为的类型

数据类型确定函数

isa() a是要确定的数据类型

运算符 ==~= 不等于==

操作符

描述

加法或一元加号。A + B将A和B。 A和B必须具有相同的尺寸，除非一个是一个标量。一个标量，可以被添加到任何大小的矩阵。

减法或一元减号。A - B，减去B从A和B必须具有相同的大小，除非是一个标量。可以从任意大小的矩阵中减去一个标量。

矩阵乘法；是一个更精确的矩阵A和B的线性代数积，

矩阵乘法对于非纯量A和B，列一个数必须等于B.标量可以乘以一个任意大小的矩阵的行数。

数组的乘法；A.*B是数组A和B的元素积，A和B必须具有相同的大小，除非A、B中有一个是标量。

斜线或矩阵右除法；B/A与B * inv（A）大致相同。更确切地说：

B/A = (A'B')'

矩阵右除法；矩阵A与矩阵B相应元素相除（A、B为同纬度的矩阵）

反斜杠或矩阵左除；如果A是一个方阵，AB是大致相同的INV（A）* B，除非它是以不同的方式计算。如果A是一个n*n的矩阵，B是一个n组成的列向量，或是由若干这样的列的矩阵，则X = AB 是方程 AX = B ，如果A严重缩小或者几乎为单数，则显示警告消息。

数组左除法；A. B是元素B（i，j）/A（i，j）的矩阵。A和B必须具有相同的大小，除非其中一个是标量。

矩阵的幂。X^P是X到幂P，如果p是标量；如果p是一个整数，则通过重复平方计算功率。如果整数为负数，X首先反转。对P值的计算，涉及到特征值和特征向量，即如果[ D ] = V，EIG（x），那么X^P = V * D.^P / V。

A.^B：A的每个元素的B次幂（A、B为同纬度的矩阵）

矩阵的转置；A'是复数矩阵A的线性代数转置，这是复共轭转置。

数组的转置；A'是数组A的转置，对于复数矩阵，这不涉及共轭。