Alphafold

Alphafold is used to predict the structure of proteins using their sequence alone. It does this through a machine learning approach that takes models of similar sequences into consideration, but mainly relies on how likely 2 particular residues are to be close to each other in 3D space.

SBGrid packages Alphafold for us (see here)

Running alphafold

If you’ve set up your environment correctly (see Configuring your environment), alphafold-predict should be in your path and will work if you are in a biokem-interactive session.

  1. Log on to OpenOnDemand (see Logging on)

  2. Start an interactive session:

    biokem-interactive
    
  3. Navigate to your working directory (usually /pl/active/<yourlab>)

  4. Make your Input file

  5. Run:

    alphafold-predict <your_fasta.fa>
    

Check the status of your job:

squeue -u $USER

Check the output of alphafold (it writes to the error file):

If it was the most recent job you submitted, you can omit the jobid argument.

slurm-err <jobid>

Input file

The only input you need to run Alphafold is your sequence(s) of interest in a file in fasta format.

  1. Navigate to a directory in your PL and make a new directory for each fold you are attempting (mkdir <dir_name>)

  2. Create a fasta file (touch <my_protein>.fa)

  3. Edit fasta using nano, vim, or other.

  • You need to add your fasta header, which much contain a > followed by your protein name

  • You will then add your protein sequence in single letter codon format

  • If you are folding a multimer, add multiple entries to this file (> + name, sequence, > + name, sequence, etc.)

example fasta:

>T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRY
KEAAEENRALAKLHHELAIVED

example multimer fasta:

>T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRY
KEAAEENRALAKLHHELAIVED
>T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRY
KEAAEENRALAKLHHELAIVED

Database

The Alphafold database is over 2TB in size and takes a prohibitively long time to download on RC infrastructure. I have downloaded it and place it in /pl/active/BioKEM/software/alphafolddb (you should be able to read this location, but won’t be able to update it.). Don’t attempt to download this database on your own, use this one (the script does this for you).

Submission script

This is the sbatch script that is actually being submitting for you: /projects/biokem/software/biokem/users/example_sbatch_scripts/alphafold/predict_monomer.q (There are few variations on this script in that folder for multimers and large proteins, alphafold-predict will submit those for you).

#!/bin/bash
#SBATCH --partition=blanca
#SBATCH --qos=preemptable
#SBATCH --account=blanca-biokem
#SBATCH --job-name=alphafold_predict
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --mem=64gb
#SBATCH --constraint=A100|A40
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --output=/home/%u/slurmfiles_out/slurm_%j.out
#SBATCH --error=/home/%u/slurmfiles_err/slurm_%j.err

#Path to fasta file, needs each monomer as own chain
FASTA=$1
echo "Predicting monomer for file: ${FASTA}"

#Run this inside SBGrid environment
export BIOKEM_ALPHA_CPUS=16
TMPDIR=$SLURM_SCRATCH
PATH=$PATH:/curc/sw/cuda/11.2/bin
_PTXAS=userpath
source /programs/sbgrid.shrc

#set to Alphafold 2.3.2 (database needs to be updated if changed)
ALPHAFOLD_X=2.3.2
DB='/pl/active/BioKEM/software/alphafolddb/'

/programs/x86_64-linux/alphafold/${ALPHAFOLD_X}/bin.capsules/run_alphafold.py \
    --data_dir=${DB} \
    --output_dir=$(pwd) \
    --fasta_paths=${FASTA} \
    --max_template_date=2020-05-14 \
    --db_preset=full_dbs \
    --bfd_database_path=${DB}bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniref30_database_path=${DB}uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --uniref90_database_path=${DB}uniref90/uniref90.fasta \
    --mgnify_database_path=${DB}mgnify/mgy_clusters_2018_12.fa \
    --template_mmcif_dir=${DB}pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=${DB}pdb_mmcif/obsolete.dat \
    --use_gpu_relax=True \
    --model_preset=monomer \
    --pdb70_database_path=${DB}pdb70/pdb70

GPU timings

I tested the speed of different GPU configurations on Blanca using a small multimer system (2 chains, ~140aa) and found that A40s, followed by A100s were the fastest. The scripts in “alphafold-predict” will submit to use either A40s or A100s.

Although the code seems to be optimized for 16 CPUS, you may change the number in the sbatch script by editing the BIOKEM_ALPHA_CPUS value.

GPU timings test

GPU type

Speed (s)

CPU 32x

11981

CPU 8x

10513

CPU 16x

7802

P100 2x

2989

T4 1x

2890

P100 1x

2513

V100 2x

2390

V100 1x

2354

Rtx6000 1x

1907

Rtx6000 2x

1895

A100 1x

1851

A100 2x

1785

A40 1x

1670

A40 2x

1670

Known errors