CASM Tutorial
- Cluster expansion (CE) model: https://pubs.acs.org/doi/10.1021/acs.chemmater.0c02695
- CE model is a lattice Hamiltonian which is used to compute formation energy by giving any configuration (occupation) consuming much smaller amount of resources than DFT
- Fit a CE model from DFT data to get thermodynamic knowledge of Na3-xSb1-xWxS4.
- Target to fit: Effective cluster interactions (ECIs), which are the coefficients for different clusters
- Code to fit: CASM developed by Prof. Anton van der Ven (UCSB)
Installation of CASM 0.3.x on Orion
- Our experience shows that CASM 0.3.x works the best, although new version (1.0) is available
- This is the easiest way to install CASM from source. We first install original CASM using Anaconda from their repository. This will install all dependencies, which is very hard to install manually. Then we uninstall the CASM from repository and reinstall our own version of CASM locally.
Clone and install CASM 0.3.x
git clone https://github.com/prisms-center/CASMcode.git
git checkout be1c21b
Create conda environment and install all packages
conda create -n casm_0.3 --override-channels -c bpuchala/label/dev -c prisms-center -c defaults -c conda-forge casm=0.3.dev269+gd07b42=condagcc_0 casm-boost=1.66.0=condagcc_0 casm-cpp=0.3.dev269+gd07b42=condagcc_0 casm-python=0.3.dev269+gd07b42=0 scikit-learn=0.21.2=py36hd81dba3_0 bokeh=1.2.0=py36_0 python=3.6.8=h0371630_0
Enable CASM environment
conda activate casm_0.3
Remove the installed CASM
conda remove casm casm-python casm-cpp --force
Compile and install CASM from source
- Make sure your environment has GCC installed. I have gcc-7.4.0 and it works.
cd CASMcode
bash build_install.sh
CE Model Construction
- Create prim.json file -> this file contains structural information about the primitive cell (we usually use exp cell) and initialize project:
casm init
Data structure:
Basis:
Coordinate -> coordiante for each site
occupant_dof -> [Na,Va] "Va" for vacancy
…
Coordinate_mode -> cartesian or fractional
Description
Lattice_vectors
Title
prim.json:
{
"basis" : [
{
"coordinate" : [ 0.500000, 0.500000, 0.500000],
"occupant_dof" : ["Na","Va"]
},
{
"coordinate" : [ 0.000000, 0.000000, 0.000000],
"occupant_dof" : ["Na","Va"]
},
{
"coordinate" : [ 0.889670, 0.610330, 0.250000],
"occupant_dof" : ["Na","Va"]
},
{
"coordinate" : [ 0.610330, 0.250000, 0.889670],
"occupant_dof" : ["Na","Va"]
},
{
"coordinate" : [ 0.250000, 0.889670, 0.610330],
"occupant_dof" : ["Na","Va"]
},
{
"coordinate" : [ 0.389670, 0.750000, 0.110330],
"occupant_dof" : ["Na","Va"]
},
{
"coordinate" : [ 0.750000, 0.110330, 0.389670],
"occupant_dof" : ["Na","Va"]
},
{
"coordinate" : [ 0.110330, 0.389670, 0.750000],
"occupant_dof" : ["Na","Va"]
},
{
"coordinate" : [ 0.352810, 0.352810, 0.352810],
"occupant_dof" : ["Zr"]
}
],
"coordinate_mode" : "Fractional",
"description" : "Si-based NASICON ",
"lattice_vectors" : [
[4.593150 ,2.651856 , 7.393667],
[-4.593150, 2.651856 , 7.393667],
[-0.000000, -5.303713, 7.393667]
],
"title" : "NASICON_prim"
}
- Create composition axes This is to define the composition that used for phase diagram 2 coupled axes are used for 2D cases, see "useful emails" for reason why we need this 2 coupled axes .casm/composition_axes.json
{
"current_axes" : "coupled",
"custom_axes" : {
"coupled" : {
"a" : [
[ 0 ],
[ 1 ],
[ 2 ],
[ 1 ],
[ 4 ]
],
"b" : [
[ 0.75 ],
[ 0.25 ],
[ -2.75 ],
[ 0.25 ],
[ 4.00 ]
],
"components" : [ "Sb", "W", "Na", "Va", "S" ],
"independent_compositions" : 2,
"origin" : [
[ 1 ],
[ 0 ],
[ 3 ],
[ 0 ],
[ 4 ]
]
}
}
}
Composition = Origin + (End-Origin)x (End = a or b here) Then compute composition axes
casm composition -c
- Import calculated DFT results Use "vasp.relax.report" to generate properties.calc.json in each directories Generate a file list containing all the path to POSCAR "reports_path_primitive.txt" Import results into .casm/config_list.json
if you want to update new results, make sure you exclude old paths in "reports_path_primitive.txt" otherwise it will have duplications in database
casm import --batch report_path.txt --ideal --data --min-energy
How to add new structure to database
- Generate a file containing all the POSCAR path 'reports_path.txt' -> this file should exclude all the imported structures otherwise it will introduce double counting!
- casm import --batch reports_path.txt --ideal --data --min-energy
- casm select --set is_calculated -o train
- casm query -k "formation_energy corr" -c train -o casm_learn_input2
You can use following code to generate report_path.txt
#!/bin/bash
# primitive cell
home_dir=$PWD
for i in `ls -d ../*/Na*/`
do
structure_name=`basename $i`
comp_path=`dirname $i`
composition=`basename $comp_path`
echo $i $structure_name $composition
result_path=primitive/$composition/$structure_name
mkdir -p $result_path
cp $i/vasprun.xml.relaxed.gz $result_path/vasprun.xml.gz
cp $i/OUTCAR.relaxed.gz $result_path/OUTCAR.gz
cp $i/OSZICAR.relaxed.gz $result_path/OSZICAR.gz
cp $i/POSCAR.orig.gz $result_path/POSCAR.gz
cp $i/CONTCAR.relaxed.gz $result_path/CONTCAR.gz
echo `realpath $result_path`/POSCAR >> report_path.txt
cd $result_path
vasp.relax.report .
cd $home_dir
done
# 222 supercell
for i in `ls -d ../supercell_222/*/Na*/`
do
structure_name=`basename $i`
comp_path=`dirname $i`
composition=`basename $comp_path`
echo $i $structure_name $composition
result_path=supercell/$composition/$structure_name
mkdir -p $result_path
cp $i/vasprun.xml.relaxed.gz $result_path/vasprun.xml.gz
cp $i/OUTCAR.relaxed.gz $result_path/OUTCAR.gz
cp $i/OSZICAR.relaxed.gz $result_path/OSZICAR.gz
cp $i/POSCAR.orig.gz $result_path/POSCAR.gz
cp $i/CONTCAR.relaxed.gz $result_path/CONTCAR.gz
echo `realpath $result_path`/POSCAR >> report_path.txt
cd $result_path
vasp.relax.report .
cd $home_dir
done
- How to fix the bug of vasp.relax.report
https://github.com/prisms-center/CASMcode/issues/120
python/casm/casm/scripts/vasp_relax_report.py:
- compat.dump(json, output, 'properties.calc.json', 'w', cls=noindent.NoIndentEncoder,
- indent=4, sort_keys=True)
+
with open('properties.calc.json', 'w') as file:
file.write(json.dumps(output, cls=noindent.NoIndentEncoder, indent=4, sort_keys=True))
+
- Choose chemical reference (per species = per atom)
'[
{"Na": 3, "Sb":1.0, "W": 0.0, "S": 4.0, "energy_per_species": -4.065837375},
{"Na": 22, "Sb": 6.0, "W": 2.0, "S": 32.0, "energy_per_species": -4.2799299},
{"Na":1.0,"energy_per_species":-1.4665042}
]'
Pass the piece above directly to the command!!
casm ref --set '[{"Na": 3, "Sb":1.0, "W": 0.0, "S": 4.0, "energy_per_species": -4.065837375},{"Na": 2.75, "Sb": 0.75, "W": 0.25, "S": 4.0, "energy_per_species": -4.2799299},{"Na":1.0,"energy_per_species":-1.4665042}]'
casm ref --set '[{"Na": 2.75, "Sb": 0.75, "W": 0.25, "S": 4.0, "energy_per_species": -4.417992175806452},{"Na": 3, "Sb":1.0, "W": 0.0, "S": 4.0, "energy_per_species": -4.065837375},{"Na":1.0,"energy_per_species":-1.4665042}]'
casm ref --set '[{"Na": 2, "W": 1, "S": 4.0, "energy_per_species": -5.603489787142857},{"Na": 3, "Sb":1.0, "W": 0.0, "S": 4.0, "energy_per_species": -4.065837375},{"Na":1.0,"energy_per_species":-1.4665042}]'
How to change reference
casm ref --setto set a new referencecasm query -k "formation_energy corr" -c calculated_config -o casm_learn_inputto get a newcasm_learn_input./fit.shto get a new fitting
To add a new column for casm_learn_input
:%s/$/ 1.0/
The chemical reference can be updated later after you add new structures
casm update
- Create basis function (it's better to use chebychev basis function) basis_sets/bset.default/bspecs.json Occupation can be changed to other properties like spin etc.. Orbit_branch_specs: set the size of cluster for generating basis function, usually decrease with the increment of order
{
"basis_functions" : {
"site_basis_functions" : "chebyshev"
},
"orbit_branch_specs" : {
"2" : {"max_length" : 12.0000},
"3" : {"max_length" : 7.00000},
"4" : {"max_length" : 6.00000}
}
}
Then compile to get basis function -> it might take 30 mins!
casm bset -u
CE Model Fitting
Please read documentation
casm-learn --desc
https://scikit-learn.org/stable/modules/cross_validation.html
- The target is to get minimum number of ECI or clusters that can fit our dataset
- E(n_sample,1) = weight(n_sample,n_sample) x Corr_matrix(n_sample,n_feature) x ECI(n_feature,1)+Constant
- Algorithm: Lasso (l1 norm) applies panalty on the total number of parameters
- alpha in Lasso: large alpha => less features, small alpha => more features
- Overfitting and underfitting: cross validation score (CV score) => split your data into training set and test set; fit your data using training set and use test set to see the error
- Why we need to do cross validation: ensure the predicatability of our model
- How we choose alpha: change alpha and plot alpha vs CV, the minimum is the alpha to choose
- CV score: 1) leave one out (LOOCV): N_sample fitting & test will be done and then average the error 2) K-Fold: split your data into N_split groups and do fitting & test procedure seperately; then average the error
- After an initial fitting, you can query data from database using
casm query. Since here we have a coupled cluster expansion, number of unique point terms are less than the total number of point terms. Therefore, some point terms should be removed. In this study, we have 50 cluster functions (basis_sets/bset.default/basis.json) and first cluster is an empty cluster.
- Prepare fitting ECI Create a folder e.g. fit_1 Select candidates for fitting and save to "calculated_config"
casm select --set is_calculated -o calculated_config
Create casm-learn input file fit.json using lasso algorithm Candidate list file: "filename" (train here) should exist in this folder
{
"estimator": {
"method": "Lasso",
"kwargs": {
"alpha": 0.0001,
"max_iter": 1000000.0
}
},
"feature_selection": {
"method": "SelectFromModel",
"kwargs": null
},
"problem_specs": {
"data": {
"y": "formation_energy",
"X": "corr",
"kwargs": null,
"type": "selection",
"filename": "calculated_config"
},
"cv": {
"penalty": 0.0,
"method": "LeaveOneOut"
}
},
"n_halloffame": 25
}
- Fit ECI
casm-learn -s fit.json
Problem specs file will be generated "fit_specs.pkl" storing the training data, weights, and cross-validation train/test sets and "fit_halloffame.pkl" storing the selected candidates Then adjust fit.json and repeat fitting until the fitting is satisfied (use least feature to reproduce most results). See "casm-learn --settings-format"
casm-learn --settings-format
Generation: eci.json and use it for monte carlo
casm-learn -s fit.json --select 0
- Plot convex hull Query energies from database:
casm query -k 'comp(a)' 'formation_energy' 'clex(formation_energy)' 'hull_dist(ALL,atom_frac)' 'clex_hull_dist(ALL,atom_frac)' -c casm_learn_input -o data.dat
Query hull from database
casm query -k 'comp(a)' 'formation_energy' 'clex(formation_energy)' 'on_hull(ALL,comp)' 'on_clex_hull(ALL,comp)' 'comp_n(Na)' -c casm_learn_input -o hull.dat
- You'll see that the difference between cluster expansion (clex) convex hull is far away from DFT convex hull. To fix this , firstly, fix the correlation (cluster expansion coefficient, see useful emails Point term) and fit the weight. Then use:
filename=$1
./clean.sh
rm ${filename%.*}_*
casm-learn -s $filename
casm-learn -s $filename --checkhull
casm-learn -s $filename --select 0
casm-learn -s $filename --hall --indiv 0 --format json > ${filename%.*}-eci.json
#casm query -k 'comp(a)' 'formation_energy' 'clex(formation_energy)' 'hull_dist(ALL,atom_frac)' 'clex_hull_dist(ALL,atom_frac)' -c ALL -o data.dat
#casm query -k 'comp(a)' 'formation_energy' 'clex(formation_energy)' 'on_hull(ALL,comp)' 'on_clex_hull(ALL,comp)' 'comp_n(Na)' -c ALL -o hull.dat
casm query -k 'comp(a)' 'formation_energy' 'clex(formation_energy)' 'hull_dist(ALL,atom_frac)' 'clex_hull_dist(ALL,atom_frac)' -c train -o data.dat
casm query -k 'comp(a)' 'formation_energy' 'clex(formation_energy)' 'on_hull(ALL,comp)' 'on_clex_hull(ALL,comp)' 'comp_n(Na)' -c train -o hull.dat
python plot_convex_refactor.py
mv Convex_hull.pdf ${filename%.*}.pdf
mv hull.dat ${filename%.*}_hull.dat
mv data.dat ${filename%.*}_fit.dat
echo ${filename%.*}
open ${filename%.*}.pdf
- To do fitting. Tuning the weight of train file until the error (CV) become small. In addition, the ECI should follow the general trend: pair is dominant then triplet, quadruplet etc. First of all, use following command to query corr to train_weight.dat
casm query -k "formation_energy corr" -c calculated_config -o casm_learn_input
Next, add a column called "weight" and put all the point term Then, using following fit.json to do fitting
{
"estimator": {
"method": "Lasso",
"kwargs": {
"alpha": 0.0001,
"max_iter": 1000000.0
}
},
"feature_selection": {
"method": "SelectFromModel",
"kwargs": null
},
"problem_specs": {
"data": {
"y": "formation_energy",
"X": "corr",
"kwargs": null,
"type": "selection",
"filename": "train"
},
"cv": {
"penalty": 0.0,
"method": "LeaveOneOut"
},
"weight":{
"method":"wCustom"
}
},
"n_halloffame": 25
}
Some Solutions for fitting CE
- Add more data after x = 0.25
- Try to use a smaller basis set
- Calculate ohter phases off the tie line (ternary phase diagram)
Monte Carlo
- Why Grand canonical Monte Carlo (GCMC) ( ensemble): because it is difficult to identify if a given structure is a single phase or a phase separated structure, i.e., in canonical ensemble (NVT), you might get phase separation. However, in grand canonical ensemble ( ensemble), we are always in the single phase region.
- The Gibbs free energy in grand canonical ensemble becomes
Grand Cononical Free Energy: where is the chemical potential and N depends on . E can be predicted by CE model when occupation vector is supplied. However, cannot be computed directly, what you can get from GCMC simulation are: E, N, occupation, and quantities that can be drived from these. should then be computed from thermodynamic integration. (see SI page 10 in https://pubs.acs.org/doi/10.1021/acs.chemmater.0c02695). Then the phase region can be visualized by taking the phases whose the has a minimum. - Phase diagram should also obey the Gibbs phase rule (https://en.wikipedia.org/wiki/Phase_rule)