CASM Tutorial

Cluster expansion (CE) model: https://pubs.acs.org/doi/10.1021/acs.chemmater.0c02695
CE model is a lattice Hamiltonian which is used to compute formation energy by giving any configuration (occupation) consuming much smaller amount of resources than DFT
Fit a CE model from DFT data to get thermodynamic knowledge of Na3-xSb1-xWxS4.
Target to fit: Effective cluster interactions (ECIs), which are the coefficients for different clusters
Code to fit: CASM developed by Prof. Anton van der Ven (UCSB)

Installation of CASM 0.3.x on Orion

Our experience shows that CASM 0.3.x works the best, although new version (1.0) is available
This is the easiest way to install CASM from source. We first install original CASM using Anaconda from their repository. This will install all dependencies, which is very hard to install manually. Then we uninstall the CASM from repository and reinstall our own version of CASM locally.

Clone and install CASM 0.3.x

git clone https://github.com/prisms-center/CASMcode.git
git checkout be1c21b

Create conda environment and install all packages

conda create -n casm_0.3 --override-channels -c bpuchala/label/dev -c prisms-center -c defaults -c conda-forge casm=0.3.dev269+gd07b42=condagcc_0 casm-boost=1.66.0=condagcc_0 casm-cpp=0.3.dev269+gd07b42=condagcc_0 casm-python=0.3.dev269+gd07b42=0 scikit-learn=0.21.2=py36hd81dba3_0 bokeh=1.2.0=py36_0 python=3.6.8=h0371630_0

Enable CASM environment

conda activate casm_0.3

Remove the installed CASM

conda remove casm casm-python casm-cpp --force

Compile and install CASM from source

Make sure your environment has GCC installed. I have gcc-7.4.0 and it works.

cd CASMcode
bash build_install.sh

CE Model Construction

Create prim.json file -> this file contains structural information about the primitive cell (we usually use exp cell) and initialize project:

casm init

Data structure:

Basis:
	Coordinate -> coordiante for each site
	occupant_dof -> [Na,Va]  "Va" for vacancy
	…
	Coordinate_mode -> cartesian or fractional
Description
Lattice_vectors
Title

prim.json:

	{
	  "basis" : [
	    {
	        "coordinate" : [ 0.500000, 0.500000, 0.500000],
	        "occupant_dof" : ["Na","Va"]
	        },
	        {
	        "coordinate" : [ 0.000000, 0.000000, 0.000000],
	        "occupant_dof" : ["Na","Va"]
	        },
	        {
	        "coordinate" : [ 0.889670, 0.610330, 0.250000],
	        "occupant_dof" : ["Na","Va"]
	        },
	        {
	        "coordinate" : [ 0.610330, 0.250000, 0.889670],
	        "occupant_dof" : ["Na","Va"]
	        },
	        {
	        "coordinate" : [ 0.250000, 0.889670, 0.610330],
	        "occupant_dof" : ["Na","Va"]
	        },
	        {
	        "coordinate" : [ 0.389670, 0.750000, 0.110330],
	        "occupant_dof" : ["Na","Va"]
	        },
	        {
	        "coordinate" : [ 0.750000, 0.110330, 0.389670],
	        "occupant_dof" : ["Na","Va"]
	        },
	        {
	        "coordinate" : [ 0.110330, 0.389670, 0.750000],
	        "occupant_dof" : ["Na","Va"]
	        },
	        {
	        "coordinate" : [ 0.352810, 0.352810, 0.352810],
	        "occupant_dof" : ["Zr"]
	        }
	  ],
	  "coordinate_mode" : "Fractional",
	  "description" : "Si-based NASICON ",
	  "lattice_vectors" : [
	    [4.593150 ,2.651856  , 7.393667],
	    [-4.593150, 2.651856 , 7.393667],
	    [-0.000000, -5.303713, 7.393667]
	  ],
	  "title" : "NASICON_prim"
	}

Create composition axes This is to define the composition that used for phase diagram 2 coupled axes are used for 2D cases, see "useful emails" for reason why we need this 2 coupled axes .casm/composition_axes.json

{
  "current_axes" : "coupled",
  "custom_axes" : {
    "coupled" : {
      "a" : [
        [ 0 ],
        [ 1 ],
        [ 2 ],
        [ 1 ],
        [ 4 ]
      ],
      "b" : [
        [ 0.75 ],
        [ 0.25 ],
        [ -2.75 ],
        [ 0.25 ],
        [ 4.00 ]
      ],
      "components" : [ "Sb", "W", "Na", "Va", "S" ],
      "independent_compositions" : 2,
      "origin" : [
        [ 1 ],
        [ 0 ],
        [ 3 ],
        [ 0 ],
        [ 4 ]
      ]
    }
  }
}

Composition = Origin + (End-Origin)x (End = a or b here) Then compute composition axes

casm composition -c

Import calculated DFT results Use "vasp.relax.report" to generate properties.calc.json in each directories Generate a file list containing all the path to POSCAR "reports_path_primitive.txt" Import results into .casm/config_list.json

tip

if you want to update new results, make sure you exclude old paths in "reports_path_primitive.txt" otherwise it will have duplications in database

casm import --batch report_path.txt --ideal --data --min-energy

tip

How to add new structure to database

Generate a file containing all the POSCAR path 'reports_path.txt' -> this file should exclude all the imported structures otherwise it will introduce double counting!
casm import --batch reports_path.txt --ideal --data --min-energy
casm select --set is_calculated -o train
casm query -k "formation_energy corr" -c train -o casm_learn_input2

You can use following code to generate report_path.txt

#!/bin/bash
# primitive cell
home_dir=$PWD
for i in `ls -d ../*/Na*/`
do
	structure_name=`basename $i`
	comp_path=`dirname $i`
	composition=`basename $comp_path`
	echo $i $structure_name $composition
	result_path=primitive/$composition/$structure_name
	mkdir -p $result_path
	cp $i/vasprun.xml.relaxed.gz  $result_path/vasprun.xml.gz
	cp $i/OUTCAR.relaxed.gz  $result_path/OUTCAR.gz
	cp $i/OSZICAR.relaxed.gz  $result_path/OSZICAR.gz
	cp $i/POSCAR.orig.gz $result_path/POSCAR.gz
	cp $i/CONTCAR.relaxed.gz $result_path/CONTCAR.gz
	echo `realpath $result_path`/POSCAR >> report_path.txt 
	cd $result_path
	vasp.relax.report .
	cd $home_dir
done

# 222 supercell
for i in `ls -d ../supercell_222/*/Na*/`
do
        structure_name=`basename $i`
        comp_path=`dirname $i`
        composition=`basename $comp_path`
        echo $i $structure_name $composition
        result_path=supercell/$composition/$structure_name
        mkdir -p $result_path
		cp $i/vasprun.xml.relaxed.gz  $result_path/vasprun.xml.gz
		cp $i/OUTCAR.relaxed.gz  $result_path/OUTCAR.gz
        cp $i/OSZICAR.relaxed.gz  $result_path/OSZICAR.gz
        cp $i/POSCAR.orig.gz $result_path/POSCAR.gz
        cp $i/CONTCAR.relaxed.gz $result_path/CONTCAR.gz
		echo `realpath $result_path`/POSCAR >> report_path.txt
		cd $result_path
        vasp.relax.report .
        cd $home_dir
done

How to fix the bug of vasp.relax.report

https://github.com/prisms-center/CASMcode/issues/120

python/casm/casm/scripts/vasp_relax_report.py:
-    compat.dump(json, output, 'properties.calc.json', 'w', cls=noindent.NoIndentEncoder, 
-                indent=4, sort_keys=True)
+
with open('properties.calc.json', 'w') as file:
file.write(json.dumps(output, cls=noindent.NoIndentEncoder, indent=4, sort_keys=True))
+

Choose chemical reference (per species = per atom)

	'[
	        {"Na": 3, "Sb":1.0, "W": 0.0, "S": 4.0, "energy_per_species": -4.065837375},
	        {"Na": 22, "Sb": 6.0, "W": 2.0, "S": 32.0, "energy_per_species": -4.2799299},
			{"Na":1.0,"energy_per_species":-1.4665042}
	]'

warning

Pass the piece above directly to the command!!

casm ref --set '[{"Na": 3, "Sb":1.0, "W": 0.0, "S": 4.0, "energy_per_species": -4.065837375},{"Na": 2.75, "Sb": 0.75, "W": 0.25, "S": 4.0, "energy_per_species": -4.2799299},{"Na":1.0,"energy_per_species":-1.4665042}]'

casm ref --set '[{"Na": 2.75, "Sb": 0.75, "W": 0.25, "S": 4.0, "energy_per_species": -4.417992175806452},{"Na": 3, "Sb":1.0, "W": 0.0, "S": 4.0, "energy_per_species": -4.065837375},{"Na":1.0,"energy_per_species":-1.4665042}]'

casm ref --set '[{"Na": 2, "W": 1, "S": 4.0, "energy_per_species": -5.603489787142857},{"Na": 3, "Sb":1.0, "W": 0.0, "S": 4.0, "energy_per_species": -4.065837375},{"Na":1.0,"energy_per_species":-1.4665042}]'

How to change reference

casm ref --set to set a new reference
casm query -k "formation_energy corr" -c calculated_config -o casm_learn_input to get a new casm_learn_input
./fit.sh to get a new fitting

To add a new column for casm_learn_input

:%s/$/ 1.0/

The chemical reference can be updated later after you add new structures

casm update

Create basis function (it's better to use chebychev basis function) basis_sets/bset.default/bspecs.json Occupation can be changed to other properties like spin etc.. Orbit_branch_specs: set the size of cluster for generating basis function, usually decrease with the increment of order

{
    "basis_functions" : {
      "site_basis_functions" : "chebyshev"
    },
    "orbit_branch_specs" : {
      "2" : {"max_length" : 12.0000},
      "3" : {"max_length" : 7.00000},
      "4" : {"max_length" : 6.00000}
    }
}

Then compile to get basis function -> it might take 30 mins!

casm bset -u

CE Model Fitting

Please read documentation

casm-learn --desc https://scikit-learn.org/stable/modules/cross_validation.html

The target is to get minimum number of ECI or clusters that can fit our dataset
E(n_sample,1) = weight(n_sample,n_sample) x Corr_matrix(n_sample,n_feature) x ECI(n_feature,1)+Constant
Algorithm: Lasso (l1 norm) applies panalty on the total number of parameters
alpha in Lasso: large alpha => less features, small alpha => more features
Overfitting and underfitting: cross validation score (CV score) => split your data into training set and test set; fit your data using training set and use test set to see the error
Why we need to do cross validation: ensure the predicatability of our model
How we choose alpha: change alpha and plot alpha vs CV, the minimum is the alpha to choose
CV score: 1) leave one out (LOOCV): N_sample fitting & test will be done and then average the error 2) K-Fold: split your data into N_split groups and do fitting & test procedure seperately; then average the error
After an initial fitting, you can query data from database using casm query. Since here we have a coupled cluster expansion, number of unique point terms are less than the total number of point terms. Therefore, some point terms should be removed. In this study, we have 50 cluster functions (basis_sets/bset.default/basis.json) and first cluster is an empty cluster.

Prepare fitting ECI Create a folder e.g. fit_1 Select candidates for fitting and save to "calculated_config"

casm select --set is_calculated -o calculated_config

Create casm-learn input file fit.json using lasso algorithm Candidate list file: "filename" (train here) should exist in this folder

	{
	  "estimator": {
	    "method": "Lasso",
	    "kwargs": {
	      "alpha": 0.0001,
	      "max_iter": 1000000.0
	    }
	  },
	  "feature_selection": {
	    "method": "SelectFromModel",
	    "kwargs": null
	  },
	  "problem_specs": {
	    "data": {
	      "y": "formation_energy",
	      "X": "corr",
	      "kwargs": null,
	      "type": "selection",
	      "filename": "calculated_config"
	    },
	    "cv": {
	      "penalty": 0.0,
	      "method": "LeaveOneOut"
	    }
	  },
	  "n_halloffame": 25
	}

Fit ECI

casm-learn -s fit.json

Problem specs file will be generated "fit_specs.pkl" storing the training data, weights, and cross-validation train/test sets and "fit_halloffame.pkl" storing the selected candidates Then adjust fit.json and repeat fitting until the fitting is satisfied (use least feature to reproduce most results). See "casm-learn --settings-format"

casm-learn --settings-format

Generation: eci.json and use it for monte carlo

casm-learn -s fit.json  --select 0

Plot convex hull Query energies from database:

casm query -k  'comp(a)' 'formation_energy'    'clex(formation_energy)' 'hull_dist(ALL,atom_frac)'  'clex_hull_dist(ALL,atom_frac)' -c casm_learn_input  -o data.dat

Query hull from database

casm query  -k  'comp(a)' 'formation_energy' 'clex(formation_energy)'  'on_hull(ALL,comp)' 'on_clex_hull(ALL,comp)' 'comp_n(Na)' -c casm_learn_input   -o hull.dat

You'll see that the difference between cluster expansion (clex) convex hull is far away from DFT convex hull. To fix this , firstly, fix the correlation (cluster expansion coefficient, see useful emails Point term) and fit the weight. Then use:

filename=$1
./clean.sh
rm ${filename%.*}_*
casm-learn -s $filename
casm-learn -s $filename --checkhull
casm-learn -s $filename --select 0
casm-learn -s $filename --hall --indiv 0 --format json > ${filename%.*}-eci.json
#casm query -k  'comp(a)' 'formation_energy'    'clex(formation_energy)' 'hull_dist(ALL,atom_frac)'  'clex_hull_dist(ALL,atom_frac)' -c ALL  -o data.dat
#casm query  -k  'comp(a)' 'formation_energy' 'clex(formation_energy)'  'on_hull(ALL,comp)' 'on_clex_hull(ALL,comp)' 'comp_n(Na)' -c ALL  -o hull.dat
casm query -k  'comp(a)' 'formation_energy'    'clex(formation_energy)' 'hull_dist(ALL,atom_frac)'  'clex_hull_dist(ALL,atom_frac)' -c train   -o data.dat
casm query  -k  'comp(a)' 'formation_energy' 'clex(formation_energy)'  'on_hull(ALL,comp)' 'on_clex_hull(ALL,comp)' 'comp_n(Na)' -c train   -o hull.dat
python plot_convex_refactor.py
mv Convex_hull.pdf ${filename%.*}.pdf
mv hull.dat ${filename%.*}_hull.dat
mv data.dat ${filename%.*}_fit.dat
echo ${filename%.*}
open ${filename%.*}.pdf

To do fitting. Tuning the weight of train file until the error (CV) become small. In addition, the ECI should follow the general trend: pair is dominant then triplet, quadruplet etc. First of all, use following command to query corr to train_weight.dat

casm query -k "formation_energy corr" -c calculated_config -o casm_learn_input

Next, add a column called "weight" and put all the point term Then, using following fit.json to do fitting

    {
     "estimator": {
     "method": "Lasso",
     "kwargs": {
     "alpha": 0.0001,
     "max_iter": 1000000.0
     }
     },
     "feature_selection": {
     "method": "SelectFromModel",
     "kwargs": null
     },
     "problem_specs": {
     "data": {
     "y": "formation_energy",
     "X": "corr",
     "kwargs": null,
     "type": "selection",
     "filename": "train"
     },
     "cv": {
     "penalty": 0.0,
     "method": "LeaveOneOut"
        },
        "weight":{
            "method":"wCustom"
        }
     },
     "n_halloffame": 25
}

Some Solutions for fitting CE

Add more data after x = 0.25
Try to use a smaller basis set
Calculate ohter phases off the tie line (ternary phase diagram)

Monte Carlo

Why Grand canonical Monte Carlo (GCMC) ( ${\mu}VT$ ensemble): because it is difficult to identify if a given structure is a single phase or a phase separated structure, i.e., in canonical ensemble (NVT), you might get phase separation. However, in grand canonical ensemble ( $\mu VT$ ensemble), we are always in the single phase region.
The Gibbs free energy in grand canonical ensemble becomes Grand Cononical Free Energy: $\Phi(\mu,T) = E - T*S + p*V - \mu *N(\mu,T)$ where $\mu$ is the chemical potential and N depends on $\mu$ . E can be predicted by CE model when occupation vector is supplied. However, $\Phi$ cannot be computed directly, what you can get from GCMC simulation are: E, N, occupation, and quantities that can be drived from these. $\Phi$ should then be computed from thermodynamic integration. (see SI page 10 in https://pubs.acs.org/doi/10.1021/acs.chemmater.0c02695). Then the phase region can be visualized by taking the phases whose the $\Phi$ has a minimum.
Phase diagram should also obey the Gibbs phase rule (https://en.wikipedia.org/wiki/Phase_rule)

Installation of CASM 0.3.x on Orion​

Clone and install CASM 0.3.x​

Create conda environment and install all packages​

Enable CASM environment​

Remove the installed CASM​

Compile and install CASM from source​

CE Model Construction​

How to change reference​

To add a new column for casm_learn_input​

CE Model Fitting​

Please read documentation​

Some Solutions for fitting CE​

Monte Carlo​