Alphafold ========= Alphafold is used to predict the structure of proteins using their sequence alone. It does this through a machine learning approach that takes models of similar sequences into consideration, but mainly relies on how likely 2 particular residues are to be close to each other in 3D space. SBGrid packages Alphafold for us (`see here `_) Running alphafold ----------------- If you've set up your environment correctly (see :doc:`../configure`), ``alphafold-predict`` should be in your path and will work if you are in a ``biokem-interactive`` session. #. Log on to OpenOnDemand (see :doc:`../logging_on`) #. Start an interactive session: .. code-block:: bash biokem-interactive #. Navigate to your working directory (usually ``/pl/active/``) #. Make your :ref:`Input file` #. Run: .. code-block:: bash alphafold-predict Check the status of your job: .. code-block:: bash squeue -u $USER Check the output of alphafold (it writes to the error file): **If it was the most recent job you submitted, you can omit the jobid argument.** .. code-block:: bash slurm-err .. _Input file: Input file ~~~~~~~~~~ The only input you need to run Alphafold is your sequence(s) of interest in a file in fasta format. #. Navigate to a directory in your PL and make a new directory for each fold you are attempting (``mkdir ``) #. Create a fasta file (``touch .fa``) #. Edit fasta using nano, vim, or other. - You need to add your fasta header, which much contain a ``>`` followed by your protein name - You will then add your protein sequence in single letter codon format - If you are folding a multimer, add multiple entries to this file (``>`` + name, sequence, ``>`` + name, sequence, etc.) example fasta: .. code-block:: bash >T1083 GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRY KEAAEENRALAKLHHELAIVED example multimer fasta: .. code-block:: bash >T1083 GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRY KEAAEENRALAKLHHELAIVED >T1083 GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRY KEAAEENRALAKLHHELAIVED .. _Database: Database ~~~~~~~~ The Alphafold database is over 2TB in size and takes a prohibitively long time to download on RC infrastructure. I have downloaded it and place it in ``/pl/active/BioKEM/software/alphafolddb`` (you should be able to read this location, but won't be able to update it.). **Don't attempt to download this database on your own, use this one (the script does this for you).** .. _Alpha submission: Submission script ~~~~~~~~~~~~~~~~~ This is the sbatch script that is actually being submitting for you: ``/projects/biokem/software/biokem/users/example_sbatch_scripts/alphafold/predict_monomer.q`` (There are few variations on this script in that folder for multimers and large proteins, alphafold-predict will submit those for you). .. code-block:: bash #!/bin/bash #SBATCH --partition=blanca #SBATCH --qos=preemptable #SBATCH --account=blanca-biokem #SBATCH --job-name=alphafold_predict #SBATCH --nodes=1 #SBATCH --ntasks=16 #SBATCH --mem=64gb #SBATCH --constraint=A100|A40 #SBATCH --gres=gpu:1 #SBATCH --time=24:00:00 #SBATCH --output=/home/%u/slurmfiles_out/slurm_%j.out #SBATCH --error=/home/%u/slurmfiles_err/slurm_%j.err #Path to fasta file, needs each monomer as own chain FASTA=$1 echo "Predicting monomer for file: ${FASTA}" #Run this inside SBGrid environment export BIOKEM_ALPHA_CPUS=16 TMPDIR=$SLURM_SCRATCH PATH=$PATH:/curc/sw/cuda/11.2/bin _PTXAS=userpath source /programs/sbgrid.shrc #set to Alphafold 2.3.2 (database needs to be updated if changed) ALPHAFOLD_X=2.3.2 DB='/pl/active/BioKEM/software/alphafolddb/' /programs/x86_64-linux/alphafold/${ALPHAFOLD_X}/bin.capsules/run_alphafold.py \ --data_dir=${DB} \ --output_dir=$(pwd) \ --fasta_paths=${FASTA} \ --max_template_date=2020-05-14 \ --db_preset=full_dbs \ --bfd_database_path=${DB}bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --uniref30_database_path=${DB}uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --uniref90_database_path=${DB}uniref90/uniref90.fasta \ --mgnify_database_path=${DB}mgnify/mgy_clusters_2018_12.fa \ --template_mmcif_dir=${DB}pdb_mmcif/mmcif_files \ --obsolete_pdbs_path=${DB}pdb_mmcif/obsolete.dat \ --use_gpu_relax=True \ --model_preset=monomer \ --pdb70_database_path=${DB}pdb70/pdb70 .. GPU timings: GPU timings ~~~~~~~~~~~ I tested the speed of different GPU configurations on Blanca using a small multimer system (2 chains, ~140aa) and found that A40s, followed by A100s were the fastest. The scripts in "alphafold-predict" will submit to use either A40s or A100s. Although the code seems to be optimized for 16 CPUS, you may change the number in the sbatch script by editing the ``BIOKEM_ALPHA_CPUS`` value. .. table:: GPU timings test :widths: auto ========== ========= GPU type Speed (s) ========== ========= CPU 32x 11981 CPU 8x 10513 CPU 16x 7802 P100 2x 2989 T4 1x 2890 P100 1x 2513 V100 2x 2390 V100 1x 2354 Rtx6000 1x 1907 Rtx6000 2x 1895 A100 1x 1851 A100 2x 1785 A40 1x 1670 A40 2x 1670 ========== ========= .. _Known errors: Known errors ------------