[文本向量化](文本向量化.md) + [逻辑回归](逻辑回归.md)
> **用法**:在 Kaggle 新建 Notebook,确保输入里选择本竞赛数据集;粘贴下方代码,直接运行即可在 `./submission.csv` 生成可提交文件。
```python
import os, re, gc
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score
from scipy.sparse import hstack
# 1) 读取数据(列名依据公开笔记/讨论,若与你的版本不同,请 print(df.columns) 自查)
DATA_DIR = "/kaggle/input/jigsaw-agile-community-rules"
train = pd.read_csv(f"{DATA_DIR}/train.csv")
test = pd.read_csv(f"{DATA_DIR}/test.csv")
sample = pd.read_csv(f"{DATA_DIR}/sample_submission.csv")
# 常见列:row_id, body, rule, subreddit, rule_violation(train 有标签)
TEXT_COLS = ["rule", "subreddit", "body"]
LABEL_COL = "rule_violation"
def build_text(row):
# 拼接规则、子版块、正文
parts = []
if "rule" in row and pd.notnull(row["rule"]):
parts.append(f"[RULE] {row['rule']}")
if "subreddit" in row and pd.notnull(row["subreddit"]):
parts.append(f"[SUB] {row['subreddit']}")
parts.append(f"[TEXT] {row['body']}")
return " ".join(parts)
train["text"] = train.apply(build_text, axis=1)
test["text"] = test.apply(build_text, axis=1)
# 2) 切分验证集(分层)
X_train, X_val, y_train, y_val = train_test_split(
train["text"], train[LABEL_COL], test_size=0.1, random_state=42, stratify=train[LABEL_COL]
)
# 3) TF-IDF 特征(字符 + 词级,可先只用字符级保证速度/鲁棒)
tf_char = TfidfVectorizer(analyzer="char_wb", ngram_range=(3,5), min_df=3, max_features=300000)
Xtr_c = tf_char.fit_transform(X_train)
Xva_c = tf_char.transform(X_val)
Xte_c = tf_char.transform(test["text"])
# (可选)词级
tf_word = TfidfVectorizer(analyzer="word", ngram_range=(1,2), min_df=3, max_features=200000)
Xtr_w = tf_word.fit_transform(X_train)
Xva_w = tf_word.transform(X_val)
Xte_w = tf_word.transform(test["text"])
Xtr = hstack([Xtr_c, Xtr_w]).tocsr()
Xva = hstack([Xva_c, Xva_w]).tocsr()
Xte = hstack([Xte_c, Xte_w]).tocsr()
del Xtr_c,Xva_c,Xte_c,Xtr_w,Xva_w,Xte_w; gc.collect()
# 4) 训练线性分类器
clf = LogisticRegression(
max_iter=2000,
n_jobs=-1,
class_weight="balanced",
solver="saga" # 稀疏高维+概率输出更稳
)
clf.fit(Xtr, y_train)
# 5) 验证 + 阈值搜索(F1)
val_prob = clf.predict_proba(Xva)[:,1]
auc = roc_auc_score(y_val, val_prob)
best_f1, best_th = -1, 0.5
for th in np.linspace(0.1,0.9,33):
f1 = f1_score(y_val, (val_prob >= th).astype(int))
if f1 > best_f1:
best_f1, best_th = f1, th
print(f"AUC={auc:.4f} | Best F1={best_f1:.4f} @ thr={best_th:.2f}")
# 6) 推理并生成提交文件(列名 & 文件名很重要)
test_prob = clf.predict_proba(Xte)[:,1]
sub = pd.DataFrame({
"row_id": test["row_id"],
"rule_violation": test_prob # 提交概率
})
sub.to_csv("submission.csv", index=False)
print(sub.head(), "\nSaved to ./submission.csv")
```
> 说明:列名选取依据公开笔记(`TEXT_COL="body"`, `LABEL_COL="rule_violation"`;样例提交 `row_id` + `rule_violation`),若你下载的数据列名略有差异,改成你的实际列名即可。([Kaggle](https://www.kaggle.com/code/gagandeep44489/jigsaw-agile-community-rules-classification-by-g?utm_source=chatgpt.com "Jigsaw-Agile Community Rules Classification BY G - Kaggle"))
> 另外,Kaggle 在提交时会跑**10 行小样本 + 约 $55k$ 隐藏集**,你可以用小样本耗时 $\times 5500$ 粗算总时长,避免超时。([Kaggle](https://www.kaggle.com/competitions/jigsaw-agile-community-rules/discussion/591393?utm_source=chatgpt.com "Jigsaw - Agile Community Rules Classification - Kaggle"))
---