大模型算法常见八股与手撕整理

Author： Arno
发布时间：November 2, 2025
71 views
2 comments
13564 words
Categories：人工智能

Arno今年从UBC毕业了，也是参加了26届的秋招，这里是我准备面试的过程中收集到的面经，在这里做一个记录，温故知新。

八股

基础数学

抛硬币的结果有正反两种可能性，有A、B、C三种特殊硬币，A会100%得到正面，B则100%反面，C会75%得到正面。三种特殊硬币各有1枚，剩下7枚为普通硬币。现在我从中取出一枚，抛10次全是正面，求我手中硬币是特殊硬币A的概率。

解题思路： 使用贝叶斯定理求后验概率。

贝叶斯公式：
$$P(A|E) = \frac{P(E|A) \cdot P(A)}{P(E)}$$

其中：

$P(A)$ = 先验概率 = 选中硬币A的概率 = 0.1
$P(E|A)$ = 似然度 = 硬币A抛10次全正面的概率 = 1.0
$P(E)$ = 全概率 = 任意硬币抛10次全正面的概率 = $\sum P(E|X) \cdot P(X)$
- $P(E|A) \cdot P(A) = 1.0 \times 0.1 = 0.1$
- $P(E|B) \cdot P(B) = 0.0 \times 0.1 = 0$
- $P(E|C) \cdot P(C) = 0.75^{10} \times 0.1 \approx 0.0056$
- $P(E|Normal) \cdot P(Normal) = 0.5^{10} \times 0.7 \approx 0.0007$

因此：$$P(A|E) = \frac{0.1}{0.1 + 0 + 0.0056 + 0.0007} \approx 0.9886$$

# Prior probabilities
p_a = p_b = p_c = 0.1  # probability of selecting special coin A, B, or C
p_normal = 0.7  # probability of selecting a normal coin

# Likelihood: P(10 heads | coin type)
p_10_heads_given_a = 1.0  # coin A always shows heads
p_10_heads_given_b = 0.0  # coin B always shows tails
p_10_heads_given_c = 0.75 ** 10  # coin C shows heads with 75% probability
p_10_heads_given_normal = 0.5 ** 10  # normal coin shows heads with 50% probability

# Total probability: P(10 heads)
p_10_heads = (p_a * p_10_heads_given_a + 
              p_b * p_10_heads_given_b + 
              p_c * p_10_heads_given_c + 
              p_normal * p_10_heads_given_normal)

# Posterior probability: P(coin A | 10 heads) using Bayes' theorem
p_a_given_10_heads = (p_a * p_10_heads_given_a) / p_10_heads

print(f"Probability that the coin is special coin A: {p_a_given_10_heads:.4f}")
# Result: approximately 0.9886

什么是泰勒展开式（Taylor Series）和麦克劳林展开式（Maclaurin Series）？

泰勒展开式： 将函数在某点 $a$ 附近展开为无穷级数。

$$f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots + \frac{f^{(n)}(a)}{n!}(x-a)^n$$

或写成求和形式：
$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x-a)^n$$

麦克劳林展开式： 泰勒展开式在 $a=0$ 处的特殊情况。

$$f(x) = f(0) + f'(0)x + \frac{f''(0)}{2!}x^2 + \frac{f'''(0)}{3!}x^3 + \cdots = \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!}x^n$$

常见函数的麦克劳林展开式：

指数函数：
$$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots = \sum_{n=0}^{\infty} \frac{x^n}{n!}$$
正弦函数：
$$\sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n+1}}{(2n+1)!}$$
余弦函数：
$$\cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n}}{(2n)!}$$
自然对数（$|x| < 1$）：
$$\ln(1+x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \cdots = \sum_{n=1}^{\infty} \frac{(-1)^{n+1} x^n}{n}$$
几何级数（$|x| < 1$）：
$$\frac{1}{1-x} = 1 + x + x^2 + x^3 + \cdots = \sum_{n=0}^{\infty} x^n$$

机器学习

XGBoost

XGBoost原理。

XGBoost (eXtreme Gradient Boosting) 是一种基于梯度提升决策树（GBDT）的优化算法，通过串行地构建多个弱学习器（决策树），每棵树修正前一棵树的残差，最终组合成强学习器。

核心思想：

加法模型（Additive Model）： 最终预测为所有树的加权和
$$\hat{y}_i = \sum_{k=1}^{K} f_k(x_i)$$
其中 $f_k$ 是第 $k$ 棵树，$K$ 是树的总数
目标函数： 损失函数 + 正则化项
$$\text{Obj} = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)$$
其中正则化项为：
$$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$
- $T$: 叶子节点数量
- $w_j$: 第 $j$ 个叶子节点的权重
- $\gamma$: 叶子节点惩罚系数
- $\lambda$: L2正则化系数
泰勒二阶展开： 对损失函数进行二阶泰勒展开
$$L(y_i, \hat{y}_i^{(t)}) \approx L(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)$$
其中：
- $g_i = \frac{\partial L(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}$ (一阶梯度)
- $h_i = \frac{\partial^2 L(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}$ (二阶梯度)
最优叶子节点权重：
$$w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$$
其中 $I_j$ 是第 $j$ 个叶子节点包含的样本集合
最优目标函数值：
$$\text{Obj}^* = -\frac{1}{2} \sum_{j=1}^{T} \frac{(\sum_{i \in I_j} g_i)^2}{\sum_{i \in I_j} h_i + \lambda} + \gamma T$$

XGBoost vs GBDT的优势：

使用二阶泰勒展开，收敛更快
支持自定义损失函数
内置正则化防止过拟合
支持并行计算（特征维度）
缺失值自动处理
支持列采样（借鉴随机森林）
加权分位数sketch处理大数据

XGBoost子节点分裂逻辑。

分裂增益计算（Gain）：

对于一个叶子节点，分裂后的增益为：
$$\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma$$

其中：

$G_L = \sum_{i \in I_L} g_i$: 左子节点的一阶梯度和
$G_R = \sum_{i \in I_R} g_i$: 右子节点的一阶梯度和
$H_L = \sum_{i \in I_L} h_i$: 左子节点的二阶梯度和
$H_R = \sum_{i \in I_R} h_i$: 右子节点的二阶梯度和
$\gamma$: 分裂的复杂度惩罚

分裂算法流程：

精确贪心算法（Exact Greedy Algorithm）：
- 遍历所有特征
- 对每个特征，按特征值排序
- 线性扫描找到最优分裂点（最大Gain）
- 时间复杂度：$O(nd \cdot \log n)$，$n$ 为样本数，$d$ 为特征数
近似算法（Approximate Algorithm）：
- 使用分位数（percentile）提出候选分裂点
- 只在候选点中寻找最优分裂
- 适用于大规模数据
加权分位数Sketch：
- 使用二阶梯度 $h_i$ 作为权重
- 定义排序函数：
  $$r_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x < z} h$$
- 找到候选分裂点使得相邻点之间的权重差异小于某个阈值 $\epsilon$

分裂停止条件：

Gain ≤ 0（分裂无增益）
达到最大深度（max_depth）
节点样本数 < min_child_weight
达到最大叶子节点数

缺失值处理：

将缺失值分别分配到左子树和右子树
选择使Gain最大的方向作为缺失值的默认方向
训练和预测时统一使用该默认方向

class XGBoostNode:
    def __init__(self, 
                 X: np.ndarray, 
                 y: np.ndarray, 
                 gradients: np.ndarray, 
                 hessians: np.ndarray,
                 lambda_reg: float = 1.0,
                 gamma: float = 0.0,
                 max_depth: int = 3,
                 min_child_weight: float = 1.0):
        """
        XGBoost tree node
        
        Args:
            X: feature matrix
            y: target values
            gradients: first order gradients
            hessians: second order gradients (Hessians)
            lambda_reg: L2 regularization parameter
            gamma: minimum loss reduction required for split
            max_depth: maximum tree depth
            min_child_weight: minimum sum of hessians in child node
        """
        self.X = X
        self.y = y
        self.gradients = gradients
        self.hessians = hessians
        self.lambda_reg = lambda_reg
        self.gamma = gamma
        self.max_depth = max_depth
        self.min_child_weight = min_child_weight
        
        self.left: Optional[XGBoostNode] = None
        self.right: Optional[XGBoostNode] = None
        self.split_feature: Optional[int] = None
        self.split_value: Optional[float] = None
        self.weight: float = self._calculate_leaf_weight()
    
    def _calculate_leaf_weight(self) -> float:
        """Calculate optimal leaf weight"""
        G = np.sum(self.gradients)
        H = np.sum(self.hessians)
        return -G / (H + self.lambda_reg)
    
    def _calculate_gain(self, G_L: float, H_L: float, G_R: float, H_R: float) -> float:
        """Calculate split gain"""
        gain = 0.5 * (
            (G_L ** 2) / (H_L + self.lambda_reg) +
            (G_R ** 2) / (H_R + self.lambda_reg) -
            ((G_L + G_R) ** 2) / (H_L + H_R + self.lambda_reg)
        ) - self.gamma
        return gain
    
    def find_best_split(self) -> tuple[Optional[int], Optional[float], float]:
        """
        Find the best split feature and value
        
        Returns:
            best_feature: index of best feature to split on
            best_value: best split value
            best_gain: best gain achieved
        """
        best_gain = 0.0
        best_feature = None
        best_value = None
        
        n_samples, n_features = self.X.shape
        
        # Iterate through all features
        for feature_idx in range(n_features):
            feature_values = self.X[:, feature_idx]
            unique_values = np.unique(feature_values)
            
            # Try all possible split points
            for value in unique_values:
                left_mask = feature_values <= value
                right_mask = ~left_mask
                
                # Check minimum child weight constraint
                H_L = np.sum(self.hessians[left_mask])
                H_R = np.sum(self.hessians[right_mask])
                
                if H_L < self.min_child_weight or H_R < self.min_child_weight:
                    continue
                
                # Calculate gain
                G_L = np.sum(self.gradients[left_mask])
                G_R = np.sum(self.gradients[right_mask])
                
                gain = self._calculate_gain(G_L, H_L, G_R, H_R)
                
                # Update best split
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_value = value
        
        return best_feature, best_value, best_gain
    
    def split(self, depth: int = 0) -> None:
        """Recursively split the node"""
        # Stop conditions
        if depth >= self.max_depth or len(self.X) == 0:
            return
        
        # Find best split
        best_feature, best_value, best_gain = self.find_best_split()
        
        # If no valid split found, make this a leaf node
        if best_feature is None or best_gain <= 0:
            return
        
        # Perform split
        self.split_feature = best_feature
        self.split_value = best_value
        
        left_mask = self.X[:, best_feature] <= best_value
        right_mask = ~left_mask
        
        # Create child nodes
        self.left = XGBoostNode(
            self.X[left_mask], self.y[left_mask],
            self.gradients[left_mask], self.hessians[left_mask],
            self.lambda_reg, self.gamma, self.max_depth, self.min_child_weight
        )
        self.right = XGBoostNode(
            self.X[right_mask], self.y[right_mask],
            self.gradients[right_mask], self.hessians[right_mask],
            self.lambda_reg, self.gamma, self.max_depth, self.min_child_weight
        )
        
        # Recursively split children
        self.left.split(depth + 1)
        self.right.split(depth + 1)
    
    def predict(self, x: np.ndarray) -> float:
        """Predict for a single sample"""
        if self.split_feature is None:
            return self.weight
        
        if x[self.split_feature] <= self.split_value:
            return self.left.predict(x)
        else:
            return self.right.predict(x)

# Example usage
def squared_loss_gradient(y_true: np.ndarray, y_pred: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    """Calculate gradients and hessians for squared loss"""
    gradients = y_pred - y_true
    hessians = np.ones_like(y_true)
    return gradients, hessians

# Initialize
X = np.random.randn(100, 5)
y = np.random.randn(100)
y_pred = np.zeros_like(y)

# Calculate gradients
gradients, hessians = squared_loss_gradient(y, y_pred)

# Build tree
root = XGBoostNode(X, y, gradients, hessians, lambda_reg=1.0, gamma=0.1, max_depth=3)
root.split()

超参数优化的方式有哪些？

常见超参数优化方法：

网格搜索（Grid Search）

原理： 穷举所有超参数组合
优点： 简单直观，保证找到搜索空间内的最优解
缺点： 计算成本高，维度灾难
适用场景： 超参数空间较小，计算资源充足

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

随机搜索（Random Search）

原理： 随机采样超参数组合
优点： 比网格搜索快，对重要参数更有效
缺点： 可能错过最优解
适用场景： 超参数空间大，不确定哪些参数重要

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(5, 20),
    'min_samples_split': randint(2, 20),
    'learning_rate': uniform(0.01, 0.3)
}

random_search = RandomizedSearchCV(
    estimator=XGBClassifier(),
    param_distributions=param_distributions,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)
random_search.fit(X_train, y_train)

贝叶斯优化（Bayesian Optimization）
- 原理： 使用贝叶斯定理建立目标函数的概率模型（高斯过程），通过采集函数（Acquisition Function）选择下一个采样点
- 常用采集函数：
- EI (Expected Improvement): $EI(x) = \mathbb{E}[\max(f(x) - f(x^+), 0)]$
- UCB (Upper Confidence Bound): $UCB(x) = \mu(x) + \kappa \sigma(x)$
- PI (Probability of Improvement): $PI(x) = P(f(x) > f(x^+))$
- 优点： 样本效率高，适合昂贵的目标函数
- 缺点： 算法复杂，高维空间效果下降
- 适用场景： 模型训练成本高，超参数维度中等（<20维）
```
from skopt import BayesSearchCV
from skopt.space import Real, Integer

search_spaces = {
    'n_estimators': Integer(100, 500),
    'max_depth': Integer(3, 15),
    'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
    'subsample': Real(0.5, 1.0),
    'colsample_bytree': Real(0.5, 1.0)
}

bayes_search = BayesSearchCV(
    XGBClassifier(),
    search_spaces,
    n_iter=50,
    cv=5,
    n_jobs=-1,
    random_state=42
)
bayes_search.fit(X_train, y_train)
```

TPE（Tree-structured Parzen Estimator）

原理： 使用树结构的Parzen估计器建模 $P(x|y)$ 和 $P(y)$，而非 $P(y|x)$
优点： 处理条件超参数，计算效率高
库： Optuna, Hyperopt
适用场景： 复杂的超参数空间，有条件依赖关系

import optuna
from xgboost import XGBClassifier

def objective(trial: optuna.Trial) -> float:
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 10)
    }
    
    model = XGBClassifier(**params, random_state=42)
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
best_params = study.best_params

遗传算法（Genetic Algorithm）
- 原理： 模拟自然选择，通过选择、交叉、变异操作进化超参数
- 优点： 适合复杂非凸优化问题
- 缺点： 需要大量评估次数
- 适用场景： 离散+连续混合空间

Hyperband / ASHA

原理： 早停策略，快速淘汰表现差的配置
优点： 大幅减少计算时间，适合深度学习
适用场景： 训练时间长的模型（神经网络）

from ray import tune
from ray.tune.schedulers import ASHAScheduler

config = {
    'n_estimators': tune.randint(100, 500),
    'max_depth': tune.randint(3, 15),
    'learning_rate': tune.loguniform(0.01, 0.3)
}

scheduler = ASHAScheduler(
    max_t=100,
    grace_period=10,
    reduction_factor=3
)

analysis = tune.run(
    train_model,
    config=config,
    num_samples=100,
    scheduler=scheduler
)

神经架构搜索（NAS）- 针对深度学习
- 方法： 强化学习、进化算法、可微分架构搜索（DARTS）
- 适用场景： 自动设计神经网络架构

总结：

| 方法 | 计算预算 | 超参数维度 | 适用模型 |
|------|---------|-----------|---------|
| Grid Search | 小 | 低 (2-3维) | 传统ML |
| Random Search | 中 | 中 (3-10维) | 传统ML/DL |
| Bayesian Opt | 小-中 | 中 (3-15维) | 传统ML |
| TPE (Optuna) | 中 | 中-高 (5-20维) | 通用 |
| Hyperband/ASHA | 大 | 高 | 深度学习 |

大模型

Transformer的结构组成和功能？
BERT、T5、GPT能解决什么问题？
模型蒸馏有几种方法，大模型一般用哪种，CV一般用哪种？他们有什么各自的优势？
为什么GPT可以并行计算，而RNN不能？
QWEN系列纯文本模型每一代的改进点是什么？
在计算Attention Score的时候，如何对Padding做Mask操作？
MHA的作用是什么？每个头学的知识代表什么？
MOE架构是什么？DS如何平衡每个专家学到的Token数量？
有没有了解过DS的MLA和MTP，解释一下。
Prompt工程，你在实践中有什么Trick？什么时候给fewshot？什么时候逻辑引导？
市面上的大模型分别有什么优势，如何选择合适的大模型？
DPO和PPO的区别？
MCP和FC是什么，你在项目中你是如何使用这两个功能的？
Transformer中的QKV注意力机制中，为什么是除以$sqrt{d_k}$而不是$d_k$？
大模型的发展历程，为什么会火，分别解决了什么问题？
llama和Mistral等大模型的实际实现、优缺点。

工程

数组和链表的区别是什么？
插入和删除选择数组还是链表，为什么？
栈的数据结构是什么？
插入或删除栈顶元素，指针是否变化？
关系型数据库和非关系型数据库的区别和优缺点？

损失函数

手撕

Leetcode

Hot100必刷

最长子串
接雨水
旋转位置编码
多维旋转位置编码

机器学习

KNN
K-Means

Transformer

位置编码

多头自注意力训练

深度学习

CNN卷积核

LSTM

Last modification：November 2, 2025

如果觉得我的文章对你有用，请随意赞赏

2 comments

鍗庣撼鍏徃鍚堜綔寮€鎴锋墍闇€鏉愭枡锛熺數璇濆彿鐮?5587291507 寰俊STS5099
November 20th, 2025 at 05:06 pm

华纳圣淘沙公司开户新手教程
零基础学会（183-8890-9465薇-STS5099）
华纳圣淘沙公司开户
华纳圣淘沙公司开户保姆级教程（183-8890-9465薇-STS5099）
一步步教你开通华纳圣淘沙公司账户（183-8890-9465薇-STS5099）
华纳圣淘沙公司开户分步图解
首次开户必看：（183-8890-9465薇-STS5099）
华纳圣淘沙全攻略
华纳圣淘沙公司开户实操手册（183-8890-9465薇-STS5099）
华纳圣淘沙开户流程视频教程
手把手教学：（183-8890-9465薇-STS5099）
华纳圣淘沙公司开户
华纳圣淘沙公司开户完全指南（183-8890-9465薇-STS5099）

Reply
鍗庣撼鍏徃鍚堜綔寮€鎴锋墍闇€鏉愭枡锛熺數璇濆彿鐮?5587291507 寰俊STS5099
November 7th, 2025 at 03:50 pm

新盛客服电话是多少？（?183-8890-9465—《?薇-STS5099】【
新盛开户专线联系方式？（?183-8890--9465—《?薇-STS5099】【?扣6011643??】
新盛客服开户电话全攻略，让娱乐更顺畅！（?183-8890--9465—《?薇-STS5099】客服开户流程，华纳新盛客服开户流程图（?183-8890--9465—《?薇-STS5099】

Reply

大模型算法常见八股与手撕整理

Arno • 2025 年 11 月 02 日

<p>Arno今年从UBC毕业了，也是参加了26届的秋招，这里是我准备面试的过程中收集到的面经，在这里做一个记录，温故知新。</p><h2>八股</h2><h3>基础数学</h3><p><div class="tip inlineBlock info">抛硬币的结果有正反两种可能性，有A、B、C三种特殊硬币，A会100%得到正面，B则100%反面，C会75%得到正面。三种特殊硬币各有1枚，剩下7枚为普通硬币。现在我从中取出一枚，抛10次全是正面，求我手中硬币是特殊硬币A的概率。</div><p><strong>解题思路：</strong> 使用贝叶斯定理求后验概率。</p><p>贝叶斯公式：<br>$$P(A|E) = \frac{P(E|A) \cdot P(A)}{P(E)}$$</p><p>其中：</p><ul><li>$P(A)$ = 先验概率 = 选中硬币A的概率 = 0.1</li><li>$P(E|A)$ = 似然度 = 硬币A抛10次全正面的概率 = 1.0</li><li><p>$P(E)$ = 全概率 = 任意硬币抛10次全正面的概率 = $\sum P(E|X) \cdot P(X)$</p><ul><li>$P(E|A) \cdot P(A) = 1.0 \times 0.1 = 0.1$</li><li>$P(E|B) \cdot P(B) = 0.0 \times 0.1 = 0$</li><li>$P(E|C) \cdot P(C) = 0.75^{10} \times 0.1 \approx 0.0056$</li><li>$P(E|Normal) \cdot P(Normal) = 0.5^{10} \times 0.7 \approx 0.0007$</li></ul></li></ul><p><div class="tip inlineBlock success">因此：$$P(A|E) = \frac{0.1}{0.1 + 0 + 0.0056 + 0.0007} \approx 0.9886$$</div><pre><code class="lang-python"># Prior probabilities
p_a = p_b = p_c = 0.1  # probability of selecting special coin A, B, or C
p_normal = 0.7  # probability of selecting a normal coin

# Likelihood: P(10 heads | coin type)
p_10_heads_given_a = 1.0  # coin A always shows heads
p_10_heads_given_b = 0.0  # coin B always shows tails
p_10_heads_given_c = 0.75 ** 10  # coin C shows heads with 75% probability
p_10_heads_given_normal = 0.5 ** 10  # normal coin shows heads with 50% probability

# Total probability: P(10 heads)
p_10_heads = (p_a * p_10_heads_given_a + 
              p_b * p_10_heads_given_b + 
              p_c * p_10_heads_given_c + 
              p_normal * p_10_heads_given_normal)

# Posterior probability: P(coin A | 10 heads) using Bayes' theorem
p_a_given_10_heads = (p_a * p_10_heads_given_a) / p_10_heads

print(f&quot;Probability that the coin is special coin A: {p_a_given_10_heads:.4f}&quot;)
# Result: approximately 0.9886</code></pre><p><div class="tip inlineBlock info">什么是泰勒展开式（Taylor Series）和麦克劳林展开式（Maclaurin Series）？</div><p><strong>泰勒展开式：</strong> 将函数在某点 $a$ 附近展开为无穷级数。</p><p>$$f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots + \frac{f^{(n)}(a)}{n!}(x-a)^n$$</p><p>或写成求和形式：<br>$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x-a)^n$$</p><p><strong>麦克劳林展开式：</strong> 泰勒展开式在 $a=0$ 处的特殊情况。</p><p>$$f(x) = f(0) + f'(0)x + \frac{f''(0)}{2!}x^2 + \frac{f'''(0)}{3!}x^3 + \cdots = \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!}x^n$$</p><p><strong>常见函数的麦克劳林展开式：</strong></p><ol><li><strong>指数函数：</strong><br>$$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots = \sum_{n=0}^{\infty} \frac{x^n}{n!}$$</li><li><strong>正弦函数：</strong><br>$$\sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n+1}}{(2n+1)!}$$</li><li><strong>余弦函数：</strong><br>$$\cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n}}{(2n)!}$$</li><li><strong>自然对数（$|x| &lt; 1$）：</strong><br>$$\ln(1+x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \cdots = \sum_{n=1}^{\infty} \frac{(-1)^{n+1} x^n}{n}$$</li><li><strong>几何级数（$|x| &lt; 1$）：</strong><br>$$\frac{1}{1-x} = 1 + x + x^2 + x^3 + \cdots = \sum_{n=0}^{\infty} x^n$$</li></ol><h3>机器学习</h3><h4>XGBoost</h4><p><div class="tip inlineBlock info">XGBoost原理。</div><p><strong>XGBoost (eXtreme Gradient Boosting)</strong> 是一种基于梯度提升决策树（GBDT）的优化算法，通过串行地构建多个弱学习器（决策树），每棵树修正前一棵树的残差，最终组合成强学习器。</p><p><strong>核心思想：</strong></p><ol><li><strong>加法模型（Additive Model）：</strong> 最终预测为所有树的加权和<br>$$\hat{y}_i = \sum_{k=1}^{K} f_k(x_i)$$<br>其中 $f_k$ 是第 $k$ 棵树，$K$ 是树的总数</li><li><p><strong>目标函数：</strong> 损失函数 + 正则化项<br>$$\text{Obj} = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)$$</p><p>其中正则化项为：<br>$$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$</p><ul><li>$T$: 叶子节点数量</li><li>$w_j$: 第 $j$ 个叶子节点的权重</li><li>$\gamma$: 叶子节点惩罚系数</li><li>$\lambda$: L2正则化系数</li></ul></li><li><p><strong>泰勒二阶展开：</strong> 对损失函数进行二阶泰勒展开<br>$$L(y_i, \hat{y}_i^{(t)}) \approx L(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)$$</p><p>其中：</p><ul><li>$g_i = \frac{\partial L(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}$ (一阶梯度)</li><li>$h_i = \frac{\partial^2 L(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}$ (二阶梯度)</li></ul></li><li><p><strong>最优叶子节点权重：</strong><br>$$w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$$</p><p>其中 $I_j$ 是第 $j$ 个叶子节点包含的样本集合</p></li><li><strong>最优目标函数值：</strong><br>$$\text{Obj}^* = -\frac{1}{2} \sum_{j=1}^{T} \frac{(\sum_{i \in I_j} g_i)^2}{\sum_{i \in I_j} h_i + \lambda} + \gamma T$$</li></ol><p><strong>XGBoost vs GBDT的优势：</strong></p><ul><li>使用二阶泰勒展开，收敛更快</li><li>支持自定义损失函数</li><li>内置正则化防止过拟合</li><li>支持并行计算（特征维度）</li><li>缺失值自动处理</li><li>支持列采样（借鉴随机森林）</li><li>加权分位数sketch处理大数据</li></ul><p><div class="tip inlineBlock info">XGBoost子节点分裂逻辑。</div><p><strong>分裂增益计算（Gain）：</strong></p><p>对于一个叶子节点，分裂后的增益为：<br>$$\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma$$</p><p>其中：</p><ul><li>$G_L = \sum_{i \in I_L} g_i$: 左子节点的一阶梯度和</li><li>$G_R = \sum_{i \in I_R} g_i$: 右子节点的一阶梯度和</li><li>$H_L = \sum_{i \in I_L} h_i$: 左子节点的二阶梯度和</li><li>$H_R = \sum_{i \in I_R} h_i$: 右子节点的二阶梯度和</li><li>$\gamma$: 分裂的复杂度惩罚</li></ul><p><strong>分裂算法流程：</strong></p><ol><li><p><strong>精确贪心算法（Exact Greedy Algorithm）：</strong></p><ul><li>遍历所有特征</li><li>对每个特征，按特征值排序</li><li>线性扫描找到最优分裂点（最大Gain）</li><li>时间复杂度：$O(nd \cdot \log n)$，$n$ 为样本数，$d$ 为特征数</li></ul></li><li><p><strong>近似算法（Approximate Algorithm）：</strong></p><ul><li>使用分位数（percentile）提出候选分裂点</li><li>只在候选点中寻找最优分裂</li><li>适用于大规模数据</li></ul></li><li><p><strong>加权分位数Sketch：</strong></p><ul><li>使用二阶梯度 $h_i$ 作为权重</li><li>定义排序函数：<br>$$r_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x &lt; z} h$$</li><li>找到候选分裂点使得相邻点之间的权重差异小于某个阈值 $\epsilon$</li></ul></li></ol><p><strong>分裂停止条件：</strong></p><ul><li>Gain ≤ 0（分裂无增益）</li><li>达到最大深度（max_depth）</li><li>节点样本数 &lt; min_child_weight</li><li>达到最大叶子节点数</li></ul><p><strong>缺失值处理：</strong></p><ul><li>将缺失值分别分配到左子树和右子树</li><li>选择使Gain最大的方向作为缺失值的默认方向</li><li>训练和预测时统一使用该默认方向</li></ul><pre><code class="lang-python">class XGBoostNode:
    def __init__(self, 
                 X: np.ndarray, 
                 y: np.ndarray, 
                 gradients: np.ndarray, 
                 hessians: np.ndarray,
                 lambda_reg: float = 1.0,
                 gamma: float = 0.0,
                 max_depth: int = 3,
                 min_child_weight: float = 1.0):
        &quot;&quot;&quot;
        XGBoost tree node
        
        Args:
            X: feature matrix
            y: target values
            gradients: first order gradients
            hessians: second order gradients (Hessians)
            lambda_reg: L2 regularization parameter
            gamma: minimum loss reduction required for split
            max_depth: maximum tree depth
            min_child_weight: minimum sum of hessians in child node
        &quot;&quot;&quot;
        self.X = X
        self.y = y
        self.gradients = gradients
        self.hessians = hessians
        self.lambda_reg = lambda_reg
        self.gamma = gamma
        self.max_depth = max_depth
        self.min_child_weight = min_child_weight
        
        self.left: Optional[XGBoostNode] = None
        self.right: Optional[XGBoostNode] = None
        self.split_feature: Optional[int] = None
        self.split_value: Optional[float] = None
        self.weight: float = self._calculate_leaf_weight()
    
    def _calculate_leaf_weight(self) -&gt; float:
        &quot;&quot;&quot;Calculate optimal leaf weight&quot;&quot;&quot;
        G = np.sum(self.gradients)
        H = np.sum(self.hessians)
        return -G / (H + self.lambda_reg)
    
    def _calculate_gain(self, G_L: float, H_L: float, G_R: float, H_R: float) -&gt; float:
        &quot;&quot;&quot;Calculate split gain&quot;&quot;&quot;
        gain = 0.5 * (
            (G_L ** 2) / (H_L + self.lambda_reg) +
            (G_R ** 2) / (H_R + self.lambda_reg) -
            ((G_L + G_R) ** 2) / (H_L + H_R + self.lambda_reg)
        ) - self.gamma
        return gain
    
    def find_best_split(self) -&gt; tuple[Optional[int], Optional[float], float]:
        &quot;&quot;&quot;
        Find the best split feature and value
        
        Returns:
            best_feature: index of best feature to split on
            best_value: best split value
            best_gain: best gain achieved
        &quot;&quot;&quot;
        best_gain = 0.0
        best_feature = None
        best_value = None
        
        n_samples, n_features = self.X.shape
        
        # Iterate through all features
        for feature_idx in range(n_features):
            feature_values = self.X[:, feature_idx]
            unique_values = np.unique(feature_values)
            
            # Try all possible split points
            for value in unique_values:
                left_mask = feature_values &lt;= value
                right_mask = ~left_mask
                
                # Check minimum child weight constraint
                H_L = np.sum(self.hessians[left_mask])
                H_R = np.sum(self.hessians[right_mask])
                
                if H_L &lt; self.min_child_weight or H_R &lt; self.min_child_weight:
                    continue
                
                # Calculate gain
                G_L = np.sum(self.gradients[left_mask])
                G_R = np.sum(self.gradients[right_mask])
                
                gain = self._calculate_gain(G_L, H_L, G_R, H_R)
                
                # Update best split
                if gain &gt; best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_value = value
        
        return best_feature, best_value, best_gain
    
    def split(self, depth: int = 0) -&gt; None:
        &quot;&quot;&quot;Recursively split the node&quot;&quot;&quot;
        # Stop conditions
        if depth &gt;= self.max_depth or len(self.X) == 0:
            return
        
        # Find best split
        best_feature, best_value, best_gain = self.find_best_split()
        
        # If no valid split found, make this a leaf node
        if best_feature is None or best_gain &lt;= 0:
            return
        
        # Perform split
        self.split_feature = best_feature
        self.split_value = best_value
        
        left_mask = self.X[:, best_feature] &lt;= best_value
        right_mask = ~left_mask
        
        # Create child nodes
        self.left = XGBoostNode(
            self.X[left_mask], self.y[left_mask],
            self.gradients[left_mask], self.hessians[left_mask],
            self.lambda_reg, self.gamma, self.max_depth, self.min_child_weight
        )
        self.right = XGBoostNode(
            self.X[right_mask], self.y[right_mask],
            self.gradients[right_mask], self.hessians[right_mask],
            self.lambda_reg, self.gamma, self.max_depth, self.min_child_weight
        )
        
        # Recursively split children
        self.left.split(depth + 1)
        self.right.split(depth + 1)
    
    def predict(self, x: np.ndarray) -&gt; float:
        &quot;&quot;&quot;Predict for a single sample&quot;&quot;&quot;
        if self.split_feature is None:
            return self.weight
        
        if x[self.split_feature] &lt;= self.split_value:
            return self.left.predict(x)
        else:
            return self.right.predict(x)

# Example usage
def squared_loss_gradient(y_true: np.ndarray, y_pred: np.ndarray) -&gt; tuple[np.ndarray, np.ndarray]:
    &quot;&quot;&quot;Calculate gradients and hessians for squared loss&quot;&quot;&quot;
    gradients = y_pred - y_true
    hessians = np.ones_like(y_true)
    return gradients, hessians

# Initialize
X = np.random.randn(100, 5)
y = np.random.randn(100)
y_pred = np.zeros_like(y)

# Calculate gradients
gradients, hessians = squared_loss_gradient(y, y_pred)

# Build tree
root = XGBoostNode(X, y, gradients, hessians, lambda_reg=1.0, gamma=0.1, max_depth=3)
root.split()</code></pre><p><div class="tip inlineBlock info">超参数优化的方式有哪些？</div><p><strong>常见超参数优化方法：</strong></p><ol><li><p>网格搜索（Grid Search）</p><ul><li><strong>原理：</strong> 穷举所有超参数组合</li><li><strong>优点：</strong> 简单直观，保证找到搜索空间内的最优解</li><li><strong>缺点：</strong> 计算成本高，维度灾难</li><li><strong>适用场景：</strong> 超参数空间较小，计算资源充足</li></ul><pre><code class="lang-python">from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_</code></pre></li><li><p>随机搜索（Random Search）</p><ul><li><strong>原理：</strong> 随机采样超参数组合</li><li><strong>优点：</strong> 比网格搜索快，对重要参数更有效</li><li><strong>缺点：</strong> 可能错过最优解</li><li><strong>适用场景：</strong> 超参数空间大，不确定哪些参数重要</li></ul><pre><code class="lang-python">from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(5, 20),
    'min_samples_split': randint(2, 20),
    'learning_rate': uniform(0.01, 0.3)
}

random_search = RandomizedSearchCV(
    estimator=XGBClassifier(),
    param_distributions=param_distributions,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)
random_search.fit(X_train, y_train)</code></pre></li><li><p>贝叶斯优化（Bayesian Optimization）</p><ul><li><strong>原理：</strong> 使用贝叶斯定理建立目标函数的概率模型（高斯过程），通过采集函数（Acquisition Function）选择下一个采样点</li><li><strong>常用采集函数：</strong></li><li>EI (Expected Improvement): $EI(x) = \mathbb{E}[\max(f(x) - f(x^+), 0)]$</li><li>UCB (Upper Confidence Bound): $UCB(x) = \mu(x) + \kappa \sigma(x)$</li><li>PI (Probability of Improvement): $PI(x) = P(f(x) &gt; f(x^+))$</li><li><strong>优点：</strong> 样本效率高，适合昂贵的目标函数</li><li><strong>缺点：</strong> 算法复杂，高维空间效果下降</li><li><strong>适用场景：</strong> 模型训练成本高，超参数维度中等（&lt;20维）</li></ul><pre><code class="lang-python">from skopt import BayesSearchCV
from skopt.space import Real, Integer

search_spaces = {
    'n_estimators': Integer(100, 500),
    'max_depth': Integer(3, 15),
    'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
    'subsample': Real(0.5, 1.0),
    'colsample_bytree': Real(0.5, 1.0)
}

bayes_search = BayesSearchCV(
    XGBClassifier(),
    search_spaces,
    n_iter=50,
    cv=5,
    n_jobs=-1,
    random_state=42
)
bayes_search.fit(X_train, y_train)</code></pre></li><li><p>TPE（Tree-structured Parzen Estimator）</p><ul><li><strong>原理：</strong> 使用树结构的Parzen估计器建模 $P(x|y)$ 和 $P(y)$，而非 $P(y|x)$</li><li><strong>优点：</strong> 处理条件超参数，计算效率高</li><li><strong>库：</strong> Optuna, Hyperopt</li><li><strong>适用场景：</strong> 复杂的超参数空间，有条件依赖关系</li></ul><pre><code class="lang-python">import optuna
from xgboost import XGBClassifier

def objective(trial: optuna.Trial) -&gt; float:
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 10)
    }
    
    model = XGBClassifier(**params, random_state=42)
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
best_params = study.best_params</code></pre></li><li><p>遗传算法（Genetic Algorithm）</p><ul><li><strong>原理：</strong> 模拟自然选择，通过选择、交叉、变异操作进化超参数</li><li><strong>优点：</strong> 适合复杂非凸优化问题</li><li><strong>缺点：</strong> 需要大量评估次数</li><li><strong>适用场景：</strong> 离散+连续混合空间</li></ul></li><li><p>Hyperband / ASHA</p><ul><li><strong>原理：</strong> 早停策略，快速淘汰表现差的配置</li><li><strong>优点：</strong> 大幅减少计算时间，适合深度学习</li><li><strong>适用场景：</strong> 训练时间长的模型（神经网络）</li></ul><pre><code class="lang-python">from ray import tune
from ray.tune.schedulers import ASHAScheduler

config = {
    'n_estimators': tune.randint(100, 500),
    'max_depth': tune.randint(3, 15),
    'learning_rate': tune.loguniform(0.01, 0.3)
}

scheduler = ASHAScheduler(
    max_t=100,
    grace_period=10,
    reduction_factor=3
)

analysis = tune.run(
    train_model,
    config=config,
    num_samples=100,
    scheduler=scheduler
)</code></pre></li><li><p>神经架构搜索（NAS）- 针对深度学习</p><ul><li><strong>方法：</strong> 强化学习、进化算法、可微分架构搜索（DARTS）</li><li><strong>适用场景：</strong> 自动设计神经网络架构</li></ul></li></ol><p><div class="tip inlineBlock success">总结：</div><pre><code>| 方法 | 计算预算 | 超参数维度 | 适用模型 |
|------|---------|-----------|---------|
| Grid Search | 小 | 低 (2-3维) | 传统ML |
| Random Search | 中 | 中 (3-10维) | 传统ML/DL |
| Bayesian Opt | 小-中 | 中 (3-15维) | 传统ML |
| TPE (Optuna) | 中 | 中-高 (5-20维) | 通用 |
| Hyperband/ASHA | 大 | 高 | 深度学习 |
</code></pre><h3>大模型</h3><ol><li>Transformer的结构组成和功能？</li><li>BERT、T5、GPT能解决什么问题？</li><li>模型蒸馏有几种方法，大模型一般用哪种，CV一般用哪种？他们有什么各自的优势？</li><li>为什么GPT可以并行计算，而RNN不能？</li><li>QWEN系列纯文本模型每一代的改进点是什么？</li><li>在计算Attention Score的时候，如何对Padding做Mask操作？</li><li>MHA的作用是什么？每个头学的知识代表什么？</li><li>MOE架构是什么？DS如何平衡每个专家学到的Token数量？</li><li>有没有了解过DS的MLA和MTP，解释一下。</li><li>Prompt工程，你在实践中有什么Trick？什么时候给fewshot？什么时候逻辑引导？</li><li>市面上的大模型分别有什么优势，如何选择合适的大模型？</li><li>DPO和PPO的区别？</li><li>MCP和FC是什么，你在项目中你是如何使用这两个功能的？</li><li>Transformer中的QKV注意力机制中，为什么是除以$sqrt{d_k}$而不是$d_k$？</li><li>大模型的发展历程，为什么会火，分别解决了什么问题？</li><li>llama和Mistral等大模型的实际实现、优缺点。</li></ol><h3>工程</h3><ol><li>数组和链表的区别是什么？</li><li>插入和删除选择数组还是链表，为什么？</li><li>栈的数据结构是什么？</li><li>插入或删除栈顶元素，指针是否变化？</li><li>关系型数据库和非关系型数据库的区别和优缺点？</li></ol><h4>损失函数</h4><h2>手撕</h2><h3>Leetcode</h3><p>Hot100必刷</p><ol><li>最长子串</li><li>接雨水</li><li>旋转位置编码</li><li>多维旋转位置编码</li></ol><h3>机器学习</h3><ol><li>KNN</li><li>K-Means</li></ol><h3>Transformer</h3><h4>位置编码</h4><h4>多头自注意力训练</h4><h3>深度学习</h3><h4>CNN卷积核</h4><h4>LSTM</h4>

大模型算法常见八股与手撕整理

八股

基础数学

机器学习

XGBoost

大模型

工程

损失函数

手撕

Leetcode

机器学习

Transformer

位置编码

多头自注意力训练

深度学习

CNN卷积核

LSTM

2 comments

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

Clash for Windows 手动安装 Service Mode

此内容被密码保护

OpenWrt 修复篇

Hello World!

27岁的一次流水账

27岁的一次流水账

创建 Jupyter Book 文档站

此内容被密码保护

Coronary Heart Disease Study

大模型算法常见八股与手撕整理

大模型算法常见八股与手撕整理

八股

基础数学

机器学习

XGBoost

大模型

工程

损失函数

手撕

Leetcode

机器学习

Transformer

位置编码

多头自注意力训练

深度学习

CNN卷积核

LSTM

2 comments

Leave a Comment Cancel reply 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

大模型算法常见八股与手撕整理

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款