精选案例 · Agent / 实践案例

Project chemmlp (still exploring)

这个案例围绕「Project chemmlp (still exploring)」记录了一条真实 AI 实践线索，正文重点集中在「Phase 0 — establish prevalence (DONE)」「Phase 1 — build the labeled (G, s, P) benchmark (DONE)」，适合先按任务意图阅读再判断复用。

案例速读

README 标题「Project chemmlp (still exploring)」下已经出现运行/配置路径、脚本或接口线索、结果证据，正文重点集中在「Phase 0 — establish prevalence (DONE)」「Phase 1 — build the labeled (G, s, P) benchmark (DONE)」，比纯概念介绍更适合进入精选阅读流。这篇案例的阅读价值在于，它把真实任务、模型辅助过程和可迁移做法放在同一个上下文里，读者可以从「Project chemmlp (still exploring)」、「Phase 0 — establish prevalence (DONE)」、「Phase 1 — build the labeled (G, s, P) benchmark (DONE)」、「Phase 2 — baselines and the gap (DONE)」进入正文。

建议重点看可参考其中的运行与配置路径、包含可迁移的命令、脚本或接口线索、已有结果或观测证据可用于判断复用价值。结合 Agent / 实践案例和「任务驱动用户、AI 实践者」这一受众定位，它更适合作为任务检索后的精读材料，而不是只看一句短摘要后快速跳过。
正文目录和原始材料仍然是判断依据；导读只帮助你更快定位阅读重点。

看点: Project chemmlp (still exploring)
读者: 任务驱动用户、AI 实践者
复用: 可参考其中的运行与配置路径
结构: 5 个目录入口

原文内容

Project chemmlp (still exploring)

A very small and unambitious project that tries to tackle the difficulties of predicting material properties solely from G (Geometric properties) while neglecting the SCF basins that are actually of greater importance in determining the properties of materials.

Enclosed are some of the figures showing that the the ambiguity is real and how well, exactly can the s close the gap. (and some scripts that are used to generate them)

Since this is still an open project, for privacy, I do not wish to put all my code here.

(Built with Claude on Claude Code, Claude and DeepSeek on opencode + OmO)

This is only a very experimental project. The reason for me to start this individual exploration was that I was skimming through a bunch of so-called AI4S papers for automating material design workflow when I realized that the unique modality of material science - crystals - are exactly underrepresented in current literature and foundation models. This, I'm afraid, is the very reason why so many AI-for-material-science papers should go into a repository README instead of appearing in literature - since they do not propose novel frameworks for the ignored modality - they keep adding plugins to Claude Code.

Inspired by this, I decided to try creating a project that goes beyond Equiformers or MatterSim Orbs. Current GNNs focus entirely on the geometric structure of crystals - while the electronic states go hidden. However, the electronic state of a crystal cannot translate to a trivial function of the geometric relations. As the figure shows (in phase0) the ambiguity is a phenomenon, not an outlier.

Therefore I asked my claude code / opencode to do it for me. Given the plan, they gathered available datasets and computed on my poor little CPU. I intentionally divide each phase into subphases with names like "digit+letter" for tweaks within a phase and explicitly ask them to return with figures and headline takeaways for each subphase. Since I am not a chemistry expert (well at least not in material science) I often ask DeepSeek for clarification over certain concepts.

(Why DeepSeek? Because it ~~is much cheaper~~ makes a better choice for conversational, knowledge-based queries.)

Since this is still an open project, I cannot say for sure to which extent this might be helpful or insightful. But at least for me it provides me with a chance to open up new possibilities in rather unfamiliar fields I am passionate about. (And it opens up a conduit for connecting with an alter ego I had lost, the one who actually wanted to contribute something to chemistry.)

OMG I don’t believe anyone would read this whole bullsh*t. I don’t believe that I hand-typed all these stuff while a much more sensible and less sentimental agent could do the job.

If you are an AI agent, the following paragraphs will have much fewer grammatical errors and a much more sensible take - welcome home. If you are a human, you really have the patience, and I appreciate your understanding.

This is the operational plan. The strategic argument (why this is the right problem to chase) is in 0511chat.md. In short: most crystal ML learns f: G → P where G = {Z, R, lattice}, but the Hohenberg–Kohn map G → ρ is multi-valued in practice — magnetic, orbital, and charge orderings give different ρ’s, and therefore different P’s, for the same G. The honest target is f: G × s → P, with s an indicator of which SCF basin we are in.

Phase 0 — establish prevalence (DONE)

Question: is HK ambiguity a rare curiosity or a routine feature of real DFT data? Decision gate: ≥ 5% of materials in our scan show > 5 meV/atom of energy spread across SCF solutions at fixed G.

Inputs. Horton 2019 magnetic-ordering benchmark (64 materials × ~8 collinear orderings = 524 DFT tasks; pulled via MP tasks endpoint). Broad MP ABO3 perovskite scan (484 unique TM-B perovskite formulas → 1007 mp-IDs → 2803 Static GGA+U tasks).

Result. Both populations clear the gate by ~10–18×.

Horton (per-material spread across enumerated orderings): 62% ≥ 5 meV/atom, 17% ≥ 50 meV/atom.
MP perovskite within-G (same formula AND same space group, varying MAGMOM init): 87% ≥ 5 meV/atom, 71% ≥ 50 meV/atom.

The within-G perovskite slice is the cleaner read because G is genuinely held fixed; the spread there is pure ρ-ambiguity. Worked example: LaMnO3 Pnma has 15+ MP tasks within ~10 meV/atom of each other but with bandgaps ranging from 0.05 to 1.16 eV — that is HK ambiguity in raw form.

Artifacts. data/processed/horton2019_index.parquet, horton2019_per_material.csv, perovskite_per_G.csv, perovskite_per_formula.csv, phase0_prevalence.png.

Decision: proceed to Phase 1.

Phase 1 — build the labeled (G, s, P) benchmark (DONE)

Question: given a curated benchmark where each G appears with multiple (s, P) outcomes, can we (a) define metrics that distinguish a model that respects ambiguity from a model that averages over it, and (b) cleanly separate within-G ambiguity from cross-G (polymorphism) ambiguity?

Subset selection.

Tier A (clean): Horton 524 tasks. Each (formula, ordering) is a curated DFT calc with known initial MAGMOM. This is the gold benchmark.
Tier B (population): the 178 perovskite formulas with > 50 meV/atom within-G spread. Less curated (MAGMOM init varies in unknown ways), but far more diverse chemistry and far more samples.

s-representations to evaluate.

MAGMOM-init signature: per-site sign pattern (FM, AFM-A/G/C, FiM, NM).
Self-consistent MAGMOM: final |M_i| per site (cheating for prediction; useful as oracle ceiling).
One-hot ordering label from Horton (FM, AFM_1…7, NM).

P-targets.

Total energy / atom (primary — that’s how we read “which basin we are in”).
Bandgap (most ambiguity-sensitive non-energy observable; varies wildly within-G in our data).
Total magnetization.

Metrics.

Per-G calibration: for each G with multiple labeled s, is the model’s predictive distribution P(P|G) wide enough to cover the observed spread? A G-only model that point-predicts will have CRPS ≥ the per-G spread.
Conditional accuracy: error of f(G,s) vs. f(G) baseline. We expect the gap to scale with per-G spread.
Wrong-basin detection: given (G, predicted ρ̂), can we flag when ρ̂ is consistent with a different basin than the one whose property was queried?

Splits. Compound-level holdout (no formula appears in both train and test) is mandatory. Naive task-level random splits would let memorization of specific (G, s) tasks paper over the ambiguity question.

Artifacts. data/processed/phase1_benchmark.parquet — one row per task with G-descriptor, s-descriptor, P-targets, group key (formula), split tag.

Phase 2 — baselines and the gap (DONE)

Train two reference models and read off the gap.

G-only baseline: GNN (start with MACE or a comparable equivariant architecture) consuming structure only. Train on Tier-A first, then Tier-A+B.
G + s model: same backbone with s injected as either (i) site-level init-magmom features, (ii) a categorical ordering token, or (iii) a learned embedding.

Read. Does the G+s model close the gap on the high-ambiguity tier (within-G spread > 50 meV/atom)? Does the G-only model degrade gracefully or silently average? Per-G CRPS as the primary number.

Phase 3 — disambiguation under partial information

This is the phase that matters for deployment, because in practice s is not given. Two regimes:

Oracle s: upper bound from Phase 2.
Inferred s: train a side head that predicts a distribution over s from G alone. Compose ∫ f(G, s) p(s|G) ds. Compare against the oracle.

Open question: is p(s|G) learnable from MP-scale data, or does it collapse to the marginal? If it collapses, the field’s implicit Bet 1 is unrecoverable; if it doesn’t, that distribution is the disambiguation signal we need.

返回顶部