User Instructions

Running the Workflow on NSF NCAR HPCs

This section walks a new user through the end-to-end steps required to configure and execute the WPS WRF Workflow on NSF NCAR’s Casper or Derecho HPC Systems. This workflow should also work on any other HPC system that has either a PBS or Slurm queue scheduler.

Prerequisites

Access and Permissions

  • Login credentials for Casper or Derecho (e.g., via ssh username@casper.hpc.ucar.edu)

  • Valid PBS project/account codes (e.g., #PBS -A NWS####)

  • Write permissions on scratch/project directories for workflow outputs and downloaded data

Local Tools

  • git for cloning the repo

  • conda (or module) to activate a Python 3.11 environment

  • PBS commands: qsub, qstat, qdel

Clone the Repository

git clone git@github.com:NCAR/wps_wrf_workflow.git
cd wps_wrf_workflow

Configure the Workflow

Main YAML Configuration

  1. Enter the config directory:

    cd config
    
  2. Open the FastEddy New Mexico config:

    vim config_fasteddy_nm.yaml
    
  3. Key fields to update:

    • template_dir: Path to ../templates/fasteddy_nm in a local clone, so that the workflow picks up the user’s edits to template namelists, job submission scripts, etc.

    • exp_name: Optional. Add/modify this field if you want to give this simulation an experiment name (e.g., if using different IC/LBC models or testing different WRF namelist options). (Default: None)

    • exp_wrf_only: Optional. If exp_name is set but you only want the experiment name to apply to real/WRF while using the same metgrid output (met_em* files) for all experiments (e.g., if only WRF namelist options are being changed), then set this to True. (Default: False.)

    • wps_ins_dir and wrf_ins_dir: Directories where WPS and WRF are installed; defaults can remain if they are read-accessible. (Default values: /glade/u/home/jaredlee/programs/WPS-4.6-dmpar and /glade/u/home/jaredlee/programs/WRF-4.6, respectively.)

      Note

      Currently the workflow assumes WPS is compiled for dmpar execution, not serial, in order to speed up the WPS programs (and on NSF NCAR HPCs, it should be compiled on Casper, not Derecho, while WRF should be compiled on Derecho, not Casper).

    • wps_run_dir and wrf_run_dir: Scratch or project paths that are write-accessible. The workflow will create subdirectories automatically. NOTE: These must be updated from the defaults in the repo.

    • grib_dir: Location to store downloaded HRRR GRIB2 files. This location must be a path that is write-accessible by the user and should be updated from the default value.

    • sim_hrs: Total simulation length in hours (e.g., 30). This value drives how most subsequent steps run. (Default value: 24.)

    • do_geogrid: Set to True to run geogrid; if geogrid outputs are already available (e.g., from a colleague), set to False. Ensure opt_output_from_geogrid_path in namelist.wps template(s) points to correct geogrid output location. (Default value: False.)

    • ungrib_domain: full ungribs the complete GRIB file domain (for HRRR this is CONUS, while for GFS and GEFS this is global). The option subset is available only for GEFS output currently, to geographically subset GEFS output to some smaller domain to enable ungrib.exe to run faster, which may be helpful in operational configurations (this optional setting could be added in the future for GFS or other IC/LBC model sources, though). (Default value: full.)

    • icbc_model and icbc_source:

      • icbc_model: e.g., hrrr: gfs: gfs-fnl: gefs:. The script is tolerant to all-lowercase or all-caps entries here. (Default value: GFS.)

      • icbc_source: AWS or GoogleCloud to dowload from AWS or GoogleCloud HRRR repositories, or GLADE to link to local RDA archive files (GLADE/RDA respositories are available for gfs and gfs-fnl but not for hrrr: or gefs:). Also note that icbc_source can tolerate all-caps or all-lowercase entries for both AWS and GoogleCloud, in addition to camel-case for GoogleCloud. (Default value: GLADE.)

    • icbc_analysis: True to initialize from the analysis (forecast hour 0) files from successive initial condition/lateral boundary condition (IC/LBC) model cycles (NOTE: this can only be done retrospectively, and should provide the best-possible ICs/LBCs); False: to initialize from forecast files from a single IC/LBC model cycle. (Default value: False.)

    • icbc_fc_dt: Normally set to 0. If set to some other positive integer N, then ICs/LBCs are obtained from an N-hour old cycle of the icbc_model. (That situation may be useful to stay ahead of the clock in operational forecast systems.) (Default value: 0.)

    • get_icbc: True to download or create symbolic links to model data to use as ICs/LBCs. If the specified files already exist locally where expected, then nothing is re-downloaded or re-linked. (Default value: False.)

    • hrrr_native: True to download the HRRR native (hybrid)-level files for atmospheric variables and pressure-level files for soil variables only (HRRR native-level files have more vertical levels [51] than pressure-level files [40] but do not include soil variables). False to download only the HRRR pressure-level files for both atmospheric and soil variables. Has no effect if icbc_model is set to something other than hrrr. (Default value: True.)

    • do_geogrid, do_ungrib, do_metgrid, do_real, do_wrf: Control each WPS/WRF step. Geogrid is a one-time domain setup; Ungrib/Metgrid/Real need to be run for each WRF forecast cycle (and potentially also for different WRF configurations for the same WRF cycle, depending on what is different). (Default values: False for all.)

    • do_avg_tsfc: If True, runs a WPS utility (avg_tsfc.exe) that calculates a 24-hour average surface temperature to better estimate lake-surface temps (avoiding interpolation from oceans, which can result in wildly inaccurate surface temperatures for inland lakes). This step is run after Ungrib but before Metgrid. If the output file (TAVGSFC) has already been generated from a previous attempt to run the workflow for this WRF cycle/ experiment, then set this to False to save a few minutes. (Default value: False.)

    • use_tavgsfc: True to use the output from the avg_tsfc.exe utility (a file called TAVGSFC) in Metgrid. This will add the appropriate line to the &metgrid section of namelist.wps if it does not already exist. (Default value: False.)

    • archive: When True, the workflow automatically moves all output (namelists, wrfout*, logs) into an archival directory (set arc_dir to a write accessible directory) for easy retrieval. (Default value: False.)

    • realtime: When True while run_wrf = True, the workflow will hold after submitting WRF and monitor the progress of the WRF job, and not move on to the next step of the workflow or the next model cycle until WRF completes successfully. If realtime = False while run_wrf = True, then the workflow will move to the next step or model cycle immediately after submitting WRF to the queue, without waiting to monitor its status, progress, or success. If run_wrf = False, then realtime has no effect. (Default value: False.)

    • account: When provided, this HPC account key will be used instead of the account key that is provided in the submit_*.bash scripts in the templates directory. If not provided, then the account key line is not overwritten.

All other fields can remain at their default values unless specialized cases arise.

Using Time-Varying (GEOS-5) or Climatological Aerosols with mp_phys=28

This section provides some brief instructions for users who wish to use the Thompson-Eidhammer aerosol-aware microphysics scheme in WRF (mp_phys=28), whether with climatological aerosols or with time-varying aerosols from GEOS-5. These instructions presume the user is using WRF 4.4+, which is when the GEOS-5 and black carbon aerosol capabilities were added. These instructions are intended to complement helpful information already posted at https://www2.mmm.ucar.edu/wrf/users/physics/mp28_updated_new.html and at https://github.com/wrf-model/WRF/pull/1616.

1. Thompson-Eidhammer climatological aerosol

The workflow does not require any special variables to be set in the config yaml file. The only settings to choose this option would come in the template WPS & WRF namelists.

  • In namelist.wps, set this in the template namelist (assuming Derecho/Casper):

&metgrid
  constants_name = '/glade/work/wrfhelp/WPS_files/QNWFA_QNIFA_QNBCA_SIGMA_MONTHLY.dat',
  • In namelist.input, ensure you have these settings in the template namelist:

&time_control
  [any options for auxinput15 or auxinput17 should be commented or removed]

&domains
  wif_input_opt     = 2,
  num_wif_levels    = 30,

&physics
  mp_physics        = 28, 28, 28,
  aer_opt           = 3,
  use_aero_icbc     = .true.,
  use_rap_aero_icbc = .false.,
  wif_fire_emit     = .false.,
  qna_update        = 0,

2. GEOS-5 time-varying aerosols

Using this option presumes that the geos2wrf software has already been utilized to process GEOS-5 global files (which are in NetCDF from NASA) into regional files in WPS Intermediate Format, with the variables that are expected to be found by Metgrid, Real, and WRF, as detailed in: https://github.com/wrf-model/WRF/pull/1616. A link to the geos2wrf Github repository will be posted here after that repo becomes publicly available.

There are some settings in the config yaml file that are required, but first, these are the settings in the template WPS & WRF namelists that should be set prior to running the workflow:

  • In namelist.wps, set something like this in the template namelist:

&metgrid
  fg_name = '/glade/derecho/scratch/jaredlee/ipc/wps/20170301_00/ungrib/GFS_FNL',’/glade/derecho/scratch/jaredlee/ipc/wps/20170301_00/ungrib/GEOS',

Note the user should not supply QNWFA_QNIFA_QNBCA_SIGMA_MONTHLY.dat in constants_name when using GEOS-5 aerosols, as that file is for the aerosol climatology.

  • In namelist.input, ensure you have these settings in the template namelist:

&time_control
  [any options for auxinput15 commented or deleted]
  auxinput17_inname     = "wrfqnainp_d0*",
  auxinput17_interval_m = 180, 180, 180,
  io_form_auxinput17    = 2,

&domains
  wif_input_opt     = 2,
  num_wif_levels    = 72,

&physics
  mp_physics        = 28, 28, 28,
  aer_opt           = 3,
  use_aero_icbc     = .false.,
  use_rap_aero_icbc = .true.,
  wif_fire_emit     = .true.,
  qna_update        = 1,
  • In addition, these options in the config yaml file need to be considered:

    • use_geos5_aero_fcst: True to use time-varying aerosol data from GEOS-5 forecasts in Metgrid, for use with the Thompson-Eidhammer aerosol-aware microphysics scheme in WRF v4.4+ (mp_phys=28). These GEOS-5 data files must already be in WPS Intermediate Format. This option will add the appropriate line/entry to the &metgrid section of namelist.wps if it does not already exist. (Default value: False.)

    • use_geos5_aero_anal: True to use time-varying aerosol data from GEOS-5 analyses in Metgrid, for use with the Thompson-Eidhammer aerosol-aware microphysics scheme in WRF v4.4+ (mp_phys=28). These GEOS-5 data files must already be in WPS Intermediate Format. This option will add the appropriate line/entry to the &metgrid section of namelist.wps if it does not already exist. (Default value: False.)

    • geos5_int_dir: String specifying the path of the parent directory where the GEOS-5 time-varying aerosol data in WPS Intermediate Format are stored. This option is only used if use_geos5_aero_fcst or use_geos5_aero_anal are set to True. (Default value: None.)

Edit Template Files

  1. Move into the FastEddy template directory:

    cd ../templates/fasteddy_nm
    
  2. Update Account in Submit Scripts:

    • Open each PBS script (e.g., submit_geogrid.bash.casper, submit_ungrib.bash.casper, etc.) and specify the desired user account to charge for core hours:

    #PBS -A <user_account_code>
    
    • The user may also wish to adjust the number of nodes and cores per node requested in some of these submit scripts based on runtime, core hour charges, etc.:

    #PBS -l select=<# of nodes>:ncpus=<# of CPUs per node>:mpiprocs=<# of MPI processes per node>
    [snip]
    mpiexec -n <# of nodes * CPUs per node> ./wrf.exe
    
  3. Modify namelist.wps.hrrr:

    • opt_output_from_geogrid_path:

    opt_output_from_geogrid_path = "/path/to/geogrid_output"
    
    • &ungrib section:

    prefix = "/path/to/ungrib_output/<CYCLE>/ungrib/HRRR"
    
    • Note: Workflow will create .../ungrib_output/<CYCLE>/hybrid and .../<CYCLE>/soil subdirectories automatically if needed.

    • &metgrid section:

      fg_name = "/path/to/ungrib_output/<CYCLE>/ungrib/HRRR_hybr", "/path/to/ungrib_output/CYCLE/ungrib/HRRR_soil",
      opt_output_from_metgrid_path = "/path/to/metgrid_output/<CYCLE>/metgrid"
      
    • If using your own WPS installation, then the user should also update these variables:

      opt_geogrid_tbl_path = '/path/to/WPS_install/geogrid',
      opt_metgrid_tbl_path = '/path/to/WPS_install/metgrid',
      

Directories specified above need write access; the control script will mkdir -p as needed and update <CYCLE> in these namelist variables automatically.

Python Environment Setup

  1. Activate Python 3.11:

    conda activate /glade/work/jaredlee/conda-envs/my-npl-202403
    
  2. Verify dependencies:

    pip install -r environment.yml
    # or ensure 'yaml', 'netCDF4', 'numpy', 'pandas', etc., import without errors
    
  3. Dependencies are declared in environment.yml, which is based on NSF NCAR’s NPL 2024a stack plus extras.

Running the Workflow

From the repository root:

# Display usage/help
python setup_wps_wrf.py -h

# Execute workflow for one cycle
python setup_wps_wrf.py \
-b 20250324_00 \
-c config/config_fasteddy_nm.yaml
  • -b YYYYMMDD_HH: Start cycle (e.g., 20250324_00)

  • -c: Workflow config YAML path (can be a relative path from setup_wps_wrf.py)

Automatic Directory Creation: The Python scripts will create all parent directories for geogrid, ungrib, metgrid, etc., based on the configured paths.

Workflow Execution Details

For running geogrid.exe, ungrib.exe, metgrid.exe, real.exe, and wrf.exe, batch job submission scripts are needed to submit them to the HPC queue. If running on a non-NSF NCAR HPC system, users will need the following submission script files in template_dir:

  • submit_geogrid.bash

  • submit_ungrib.bash

  • submit_metgrid.bash

  • submit_real.bash

  • submit_wrf.bash

However, if users are running on NSF NCAR HPCs (Casper and/or Derecho), WPS needs to be compiled on Casper (whose queue allows for single-core jobs without reserving an entire node), while WRF needs to be compiled on Derecho (whose queues require reserving an entire 128-core node even if only 1 core is used). Set wps_ins_dir and wrf_ins_dir to point to those installation directories. Both Derecho and Casper allow peer scheduling to queues on either machine from either machine (see: Peer Scheduling scheduling between systems for more information). To enable transparent-to-the-user execution of the entire workflow from a login node on either Casper or Derecho, two sets of files are needed. If executing the workflow on Casper, these files need to be in template_dir, with the submit_real and submit_wrf scripts including the required syntax to submit to a queue on Derecho from Casper:

  • submit_geogrid.bash.casper

  • submit_ungrib.bash.casper

  • submit_metgrid.bash.casper

  • submit_real.bash.casper

  • submit_wrf.bash.casper

If executing the workflow on Derecho, then these files need to be in template_dir, with the submit_geogrid, submit_ungrib, and submit_metgrid scripts including the required syntax to submit to a queue on Casper from Derecho:

  • submit_geogrid.bash.derecho

  • submit_ungrib.bash.derecho

  • submit_metgrid.bash.derecho

  • submit_real.bash.derecho

  • submit_wrf.bash.derecho

The workflow will automatically copy the appropriate submission script template to the run directories and strip the .casper or .derecho file name suffix if they exist.

Additionally, note that in template_dir the namelist templates must have suffixes corresponding to icbc_model, to enable WRF experiments that can utilize different models for the ICs/LBCs. This is done because there are typically different numbers of soil or atmospheric levels in each model’s output, which requires different values for certain namelist settings, and to not over-complicate the workflow scripts with lots of if/then loops to handle model-specific changes to namelist variables that might be further complicated with future updates to those external models, the number of output levels, or other key parameters. For example, if a user wants to be able to run WRF driven by GFS, GFS-FNL, or HRRR output, the user would need these files in template_dir:

  • namelist.input.gfs

  • namelist.input.gfs-fnl

  • namelist.input.hrrr

  • namelist.wps.gfs

  • namelist.wps.gfs-fnl

  • namelist.wps.hrrr

Note that users only need to have the template files corresponding to the desired icbc_model variants that they would like to be available to use.

If users use HRRR model output as ICs/LBCs for WRF, note that the number of vertical levels is different in the native (hybrid)-level output (51) than in the pressure-level output (40). Therefore, if users want the flexibility to run with either native/hybrid or pressure-level HRRR output, then two different template WRF namelists in template_dir are needed:

  • namelist.input.hrrr.hybr

  • namelist.input.hrrr.pres

If users do not have either of these files, the workflow defaults to using namelist.input.hrrr, which then may cause an error when real.exe is run if the wrong value for num_metgrid_levels is specified in namelist.input.hrrr for the type of HRRR output.

Note that if the user only intends to run with ONLY hybrid-level or ONLY pressure-level HRRR output, then the user will only need to have namelist.input.hrrr present; just ensure that the correct value for num_metgrid_levels is set in namelist.input.hrrr.

Also note that for the WPS and WRF namelists, this workflow does NOT generate grid/domain information from scratch or from any user inputs. The user is required to specify the grid/domain details in advance in these namelist template files. If the expected template namelist files do not exist prior to running the workflow, then the workflow will fail. Other tools already exist for setting grid/domain configurations for WPS and WRF namelists, such as WRF Domain Wizard. Future updates to the workflow may add the capability to specify domain configuration details in a YAML file to automatically update the WPS and WRF namelists.

One final note: If the user desires to control which variables are written out to history streams, then there should also be a file (or multiple file names separated by commas, which could be the same or unique for each domain) set by the user in the &time_control section of namelist.input.{icbc_model}, such as:

iofields_filename = “vars_io.txt”,

Any files listed on that line should be stored in template_dir. If any requested files are not found in template_dir, the workflow will log a warning, and WRF will still run, but then the default output variables for the specified stream(s) in the file will be written out for that domain. For more information on this file and its required syntax, see the WRF Model README.io_config file.

  1. ICBC Download/Link:

    • Downloads IC/LBC files from a web server or links to them in a local repository. For example, if icbc_model = hrrr and hrrr_native = True, then download_hrrr_from_aws.py downloads HRRR native-grid (hrrr.YYYYMMDD/CONUS/hrrr.tHHz.wrfnatf00.grib2) then pressure-grid (wrfprs) files for each hour in the requested simulation.

    • Skips download/linking if files already exist locally (useful for repeated runs).

  2. Ungrib:

    • Ungrib is inherently serial; the workflow subdivides it per hour and runs 2×N jobs (hybrid and soil, if using HRRR native-grid files) or N jobs (for all other IC/LBC models) to make it embarrassingly parallel. Ungrib is run separately as a 1-core job for each icbc_model file in its own directory to avoid ungrib.exe cleanup processes that delete all files matching a starting pattern, which often causes “file not found” errors when running multiple instances of ungrib.exe simultaneously within the same directory.

    • Includes a short sleep (1–3 s) between qsub calls to avoid overloading the PBS queue.

    • WPS intermediate format files (YYYYMMDD_HH/ungrib/HRRR_hybrid*, *HRRR_soil*) move into a combined ungrib/ directory once complete.

  3. Geogrid:

    • Domain setup; runs once per domain. Subsequent simulations using the same domain can skip by setting do_geogrid: False.

  4. avg_tsfc:

    • Calculates a 24-h average surface temperature field to improve lake-surface temps in land masks. Ignores times outside whole 24-h periods by default.

  5. Metgrid:

    • Uses ungrib (and optionally avg_tsfc) outputs to produce NetCDF files on the WRF horizontal grid but on the vertical levels from the ungribbed WPS intermediate format file.

  6. Real:

    • Takes output from metgrid (met_em_d0* files) and puts it onto the full 3D WRF grid to generate initial-time (wrfinput_d0*), lateral boundary condition (wrfbdy_d01), and (optionally) lower boundary condition (wrflowinp_d0*) files that span the requested simulation time.

    • Submits via qsub submit_real.bash; monitors job status.

    • Logs for every processor executing real.exe will appear in rsl.out.* and rsl.error.* files. Note that WRF writes logs to the same file names, so these will be overwritten unless moved elsewhere.

  7. WRF:

    • Submits WRF model via qsub submit_wrf.bash; monitors job status.

    • If a user types CTRL+C, WRF continues running on the compute nodes; logs and wrfout* files appear in the wrf/ subdirectory. Otherwise, the workflow will monitor the WRF simulation’s progress, and only exit upon finding an error or success message in the log files. A future update will clarify how to move on to the next WPS/WRF cycle after submitting WRF, without waiting to monitor the WRF job.

Monitoring and Troubleshooting

  • Log Locations: Each step (geogrid/, ungrib/, metgrid/, real/, wrf/) has its own *.log files (or rsl.* files for real.exe and wrf.exe). Currently, the workflow scripts only look for key phrases to indicate success or failure of the job, and does not analyze the error messages to provide hints about what might be wrong. Future enhancements to the workflow could include such helpful hints, though. The WRF & MPAS-A Forum is a useful resource to consult for WPS & WRF troubleshooting issues.

  • Inspecting Jobs:

    qstat -u $USER       # List running PBS jobs
    tail -f wrf/logs/metgrid.log  # Follow metgrid progress
    
  • Common Errors:

    • Error in ext_pkg_open_for_write_begin: Write-permission error on output path - verify wps_run_dir and template prefixes.

    • Missing Python modules: Ensure the Python 3.11 environment with required packages has been activated.

    • Slurm vs PBS scripts: A warning check_job_status.sh references Slurm; it can be ignored or updated for PBS compatibility.

Reviewing Output

  • Data Directory: For example for HRRR, data/hrrr/hrrr.YYYYMMDD/conus/ for raw GRIB2 files for ICs/LBCs.

  • Workflow Directory:

    • ungrib/, geogrid/, metgrid/ subfolders within wps_run_dir/YYYYMMDD_HH/

    • Log files, wrfinput*, wrfbdy, and wrfout* files within wrf_run_dir/YYYYMMDD_HH/

  • Archive: If archive: True, all run artifacts move to arc_dir/YYYYMMDD_HH/ upon completion.