smpl.envs package

Subpackages

Submodules

smpl.envs.atropineenv module

AtropineEnv simulates an atropine production environment.

class smpl.envs.atropineenv.AtropineEnvGym(dense_reward=True, normalize=True, debug_mode=False, action_dim=4, reward_function=None, done_calculator=None, observation_name=None, action_name=None, np_dtype=<class 'numpy.float32'>, max_steps=60, error_reward=-100000.0, x0_loc='https://raw.githubusercontent.com/smpl-env/smpl/main/smpl/configdata/atropineenv/x0.txt', z0_loc='https://raw.githubusercontent.com/smpl-env/smpl/main/smpl/configdata/atropineenv/z0.txt', model_loc='https://github.com/smpl-env/smpl-experiments/blob/main/configdata/atropineenv/model.npy?raw=true', uss_subtracted=True, reward_on_ess_subtracted=False, reward_on_steady=True, reward_on_absolute_efactor=False, reward_on_actions_penalty=0.0, reward_on_reject_actions=True, reward_scaler=1.0, relaxed_max_min_actions=False, observation_include_t=True, observation_include_action=False, observation_include_uss=True, observation_include_ess=True, observation_include_e=True, observation_include_kf=True, observation_include_z=True, observation_include_x=False)[source]

Bases: smpl.envs.utils.smplEnvBase

done_calculator_standard(current_observation, step_count, reward, done=None, done_info=None)[source]
check whether the current episode is considered finished.

returns a boolean value indicated done or not, and a dictionary with information. here in done_calculator_standard, done_info looks like {“terminal”: boolean, “timeout”: boolean}, where “timeout” is true when episode end due to reaching the maximum episode length, “terminal” is true when “timeout” or episode end due to termination conditions such as env error encountered. (basically done)

Parameters
  • current_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • step_count ([int]) – step_count.

  • reward ([float]) – reward.

  • done ([bool], optional) – Defaults to None.

  • done_info ([dict], optional) – how the environment is finished. Defaults to None.

Returns

done and done_info.

Return type

[(float, dict)]

evaluate_rewards_mean_std_over_episodes(algorithms, num_episodes=1, error_reward=None, initial_states=None, to_plt=True, plot_dir='./plt_results', computer_on_episodes=False)[source]

returns: mean and std of rewards over all episodes. since the rewards_list is not aligned (e.g. some trajectories are shorter than the others), so we cannot directly convert it to numpy array. we have to convert and unwrap the nested list. if computer_on_episodes, we first average the rewards_list over episodes, then compute the mean and std. else, we directly compute the mean and std for each step.

evalute_algorithms(algorithms, num_episodes=1, error_reward=None, initial_states=None, to_plt=True, plot_dir='./plt_results')[source]

when excecuting evalute_algorithms, the self.normalize should be False. algorithms: list of (algorithm, algorithm_name, normalize). algorithm has to have a method predict(observation) -> action: np.ndarray. num_episodes: number of episodes to run error_reward: overwrite self.error_reward initial_states: None, location of numpy file of initial states or a (numpy) list of initial states to_plt: whether generates plot or not plot_dir: None or directory to save plots returns: list of average_rewards over each episode and num of episodes

plot(show=False, efactor_fig_name=None, input_fig_name=None)[source]
reset(initial_state=None, normalize=None)[source]

Required by gym, this function resets the environment and returns an initial observation.

reward_function_standard(previous_observation, action, current_observation, reward=None)[source]

the s, a, r, s, a calculation.

Parameters
  • previous_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • action ([np.ndarray]) – This is denormalized action, as usual.

  • current_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • reward ([float]) – If reward is provided, directly return the reward.

Returns

reward.

Return type

[float]

sample_initial_state(no_sample=False, lower_bound=0.99, upper_bound=1.01)[source]
step(action, normalize=None)[source]

Required by gym, his function performs one step within the environment and returns the observation, the reward, whether the episode is finished and debug information, if any.

class smpl.envs.atropineenv.AtropineMPC(model_loc='https://github.com/smpl-env/smpl-experiments/blob/main/configdata/atropineenv/model.npy?raw=true', N=30, Nx=2, Nu=4, uss_subtracted=True, reward_on_ess_subtracted=False, reward_on_steady=True, reward_on_absolute_efactor=False, reward_on_actions_penalty=0.0, reward_on_reject_actions=True, relaxed_max_min_actions=False, observation_include_t=True, observation_include_action=False, observation_include_uss=True, observation_include_ess=True, observation_include_e=True, observation_include_kf=True, observation_include_z=True, observation_include_x=False)[source]

Bases: object

predict(state)[source]

smpl.envs.beerfmtenv module

BeerFMT simulates the Beer Fermentation process.

class smpl.envs.beerfmtenv.BeerFMTEnvGym(dense_reward=True, normalize=True, debug_mode=False, action_dim=1, observation_dim=8, reward_function=None, done_calculator=None, max_observations=[15, 15, 15, 150, 150, 10, 10, 200], min_observations=[0, 0, 0, 0, 0, 0, 0, 0], max_actions=[16.0], min_actions=[9.0], observation_name=['X_A', 'X_L', 'X_D', 'S', 'EtOH', 'DY', 'EA', 'time'], action_name=['temperature'], np_dtype=<class 'numpy.float32'>, max_steps=200, error_reward=-200.0)[source]

Bases: smpl.envs.utils.smplEnvBase

done_calculator_standard(current_observation, step_count, reward, update_prev_biomass=False, done=None, done_info=None)[source]
check whether the current episode is considered finished.

returns a boolean value indicated done or not, and a dictionary with information. here in done_calculator_standard, done_info looks like {“terminal”: boolean, “timeout”: boolean}, where “timeout” is true when episode end due to reaching the maximum episode length, “terminal” is true when “timeout” or episode end due to termination conditions such as env error encountered. (basically done)

Parameters
  • current_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • step_count ([int]) – step_count.

  • reward ([float]) – reward.

  • done ([bool], optional) – Defaults to None.

  • done_info ([dict], optional) – how the environment is finished. Defaults to None.

Returns

done and done_info.

Return type

[(float, dict)]

reset(initial_state=None, normalize=None)[source]

required by gym. This function resets the environment and returns an initial observation.

reward_function_standard(previous_observation, action, current_observation, reward=None)[source]

the s, a, r, s, a calculation.

Parameters
  • previous_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • action ([np.ndarray]) – This is denormalized action, as usual.

  • current_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • reward ([float]) – If reward is provided, directly return the reward.

Returns

reward.

Return type

[float]

sample_initial_state()[source]
step(action, normalize=None)[source]

required by gym. This function performs one step within the environment and returns the observation, the reward, whether the episode is finished and debug information, if any.

smpl.envs.beerfmtenv.beer_ode(points, t, sets)[source]

Beer fermentation process

smpl.envs.pensimenv module

class smpl.envs.pensimenv.PenSimEnvGym(recipe_combo, dense_reward=True, normalize=True, debug_mode=False, action_dim=6, observation_dim=9, reward_function=None, done_calculator=None, max_observations=[552.0, 16.10523, 725.6828, 13.717274, 540.0, 3600.0002, 1892.07874, 253840.11, 47.898834], min_observations=[0.0, 0.0, 118.98977, 0.0, 0.0, 0.0, 0.0, 25003.258, 0.0], max_actions=[4100.0, 151.0, 36.0, 76.0, 1.2, 510.0], min_actions=[0.0, 7.0, 21.0, 29.0, 0.5, 0.0], observation_name=None, action_name=None, initial_state_deviation_ratio=0.1, np_dtype=<class 'numpy.float32'>, max_steps=1150, error_reward=-100.0, fast=True, random_seed=0, random_seed_max=20000)[source]

Bases: pensimpy.peni_env_setup.PenSimEnv, smpl.envs.utils.smplEnvBase

reset(normalize=None, random_seed_ref=None)[source]

Setup the envs and return the observation class x.

sample_initial_state(random_seed_ref=None)[source]
step(action, normalize=None)[source]

Simulate the fermentation process by solving ODE.

class smpl.envs.pensimenv.PeniControlData(load_just_a_file='', dataset_folder='examples/example_batches', delimiter=', ', observation_dim=9, action_dim=6, normalize=True, np_dtype=<class 'numpy.float32'>)[source]

Bases: object

dataset class helper, mainly aims to mimic d4rl’s qlearning_dataset format (which returns a dictionary). produced from PenSimPy generated csvs.

get_dataset()[source]
load_file_list_to_dict(file_list, shuffle=True)[source]
smpl.envs.pensimenv.get_observation_data_reformed(observation, t)[source]

Get observation data at t. vars are Temperature,Acid flow rate,Base flow rate,Cooling water,Heating water,Vessel Weight,Dissolved oxygen concentration respectively in csv terms, but used abbreviation here to stay consistent with peni_env_setup

smpl.envs.reactorenv module

ReactorEnv simulates a general reactor environment. This is supposed to be an template environment. The documentations in that file is enhanced and provided comment lines (# —- standard —- and # /—- standard —-) enclose pieces of code that should be reused by most of smpl environments. Please consult the smplEnvBase class.

class smpl.envs.reactorenv.ReactorEnvGym(dense_reward=True, normalize=True, debug_mode=False, action_dim=2, observation_dim=3, reward_function=None, done_calculator=None, max_observations=[1.0, 100.0, 1.0], min_observations=[1e-08, 1e-08, 1e-08], max_actions=[35.0, 0.2], min_actions=[15.0, 0.05], error_reward=-1000.0, initial_state_deviation_ratio=0.3, compute_diffs_on_reward=False, np_dtype=<class 'numpy.float32'>, sampling_time=0.1, max_steps=100)[source]

Bases: smpl.envs.utils.smplEnvBase

evaluate_observation(observation)[source]

observation: numpy array of shape (self.observation_dim) returns: observation evaluation (reward in a sense)

evaluate_rewards_mean_std_over_episodes(algorithms, num_episodes=1, error_reward=None, initial_states=None, to_plt=True, plot_dir='./plt_results', computer_on_episodes=False)[source]

returns: mean and std of rewards over all episodes. since the rewards_list is not aligned (e.g. some trajectories are shorter than the others), so we cannot directly convert it to numpy array. we have to convert and unwrap the nested list. if computer_on_episodes, we first average the rewards_list over episodes, then compute the mean and std. else, we directly compute the mean and std for each step.

evalute_algorithms(algorithms, num_episodes=1, error_reward=None, initial_states=None, to_plt=True, plot_dir='./plt_results')[source]

when excecuting evalute_algorithms, the self.normalize should be False. algorithms: list of (algorithm, algorithm_name, normalize). algorithm has to have a method predict(observation) -> action: np.ndarray. num_episodes: number of episodes to run error_reward: overwrite self.error_reward initial_states: None, location of numpy file of initial states or a (numpy) list of initial states to_plt: whether generates plot or not plot_dir: None or directory to save plots returns: list of average_rewards over each episode and num of episodes

evenly_spread_initial_states(val_per_state, dump_location=None)[source]

Evenly spread initial states. This function is needed only if the environment has steady_observations.

Parameters

val_per_state (int) – how many values to sampler per state.

Returns: [initial_states]: evenly spread initial_states.

find_outperformances(algorithms, rewards_list, initial_states, threshold=0.05, top_k=10)[source]

this function computes the outperformances of the last algorithm in algorithms. there are three criteria: if in a trajectory, the algorithm has reward >= all other algorithms, the corresponding initial_state is stored to always_better. if in a trajectory, the algorithm’s mean reward >= threshold + all other algorithms’ mean reward, the corresponding initial_state is stored to averagely_better. for the top_k most outperformed reward mean, the corresponding initial_state is stored to top_k_better, in ascending order.

find_outperformances_compute_always_better(rewards)[source]
find_outperformances_compute_average_outperformances(rewards)[source]
generate_dataset_with_algorithm(algorithm, normalize=None, num_episodes=1, error_reward=- 1000.0, initial_states=None, format='d4rl')[source]

this function aims to create a dataset for offline reinforcement learning, in either d4rl or pytorch format. the trajectories are generated by the algorithm, which interacts with this env initialized by initial_states. algorithm: an instance that has a method predict(observation) -> action: np.ndarray. if format == ‘d4rl’, returns a dictionary in d4rl format. else if format == ‘torch’, returns an object of type torch.utils.data.Dataset.

reset(initial_state=None, normalize=None)[source]

required by gym. This function resets the environment and returns an initial observation.

reward_function_standard(previous_observation, action, current_observation, reward=None)[source]

the s, a, r, s, a calculation.

Parameters
  • previous_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • action ([np.ndarray]) – This is denormalized action, as usual.

  • current_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • reward ([float]) – If reward is provided, directly return the reward.

Returns

reward.

Return type

[float]

sample_initial_state()[source]
step(action, normalize=None)[source]

required by gym. This function performs one step within the environment and returns the observation, the reward, whether the episode is finished and debug information, if any.

class smpl.envs.reactorenv.ReactorMLPReinforceAgent(obs_dim=3, act_dim=2, hidden_size=6, device='cpu')[source]

Bases: torch.nn.modules.module.Module

a simple torch-MLP based rl agent,.

Returns

Can either return a sampled action or the distribution the action sampled from.

Return type

[type]

forward(obs, return_distribution=True)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class smpl.envs.reactorenv.ReactorMPC(Nt=20, dt=0.1, Q=array([[0.1, 0., 0.], [0., 0.1, 0.], [0., 0., 0.1]]), R=array([[0., 0.], [0., 0.]]), P=array([[0.1, 0., 0.], [0., 0.1, 0.], [0., 0., 0.1]]))[source]

Bases: object

a simple MPC controller.

Returns

action.

Return type

[type]

Pffunc(x)[source]
build_controller()[source]
lfunc(x, u)[source]
predict(x)[source]
class smpl.envs.reactorenv.ReactorModel(sampling_time)[source]

Bases: object

ode(x, u)[source]
step(x, u)[source]
class smpl.envs.reactorenv.ReactorPID(Kis, steady_state=[0.8778252, 0.659], steady_action=[26.85, 0.1], min_action=[15.0, 0.05], max_action=[35.0, 0.2])[source]

Bases: object

a simple proportional controller that utilizes the config of this environment.

Returns

action.

Return type

[type]

predict(state)[source]

smpl.envs.utils module

class smpl.envs.utils.TorchDatasetFromD4RL(dataset_d4rl)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

class smpl.envs.utils.smplEnvBase(dense_reward=True, normalize=True, debug_mode=False, action_dim=2, observation_dim=3, max_observations=[1.0, 1.0], min_observations=[-1.0, -1.0], max_actions=[1.0, 1.0], min_actions=[-1.0, -1.0], observation_name=None, action_name=None, initial_state_deviation_ratio=None, np_dtype=<class 'numpy.float32'>, max_steps=None, error_reward=-100.0)[source]

Bases: gym.core.Env

action_beyond_box(action)[source]

check if the action is beyond the box, which is what we don’t want.

Parameters

action ([np.ndarray]) – This is denormalized action, as usual.

Returns

action is beyond the box or not.

Return type

[bool]

algorithms_to_algo_names(algorithms)[source]
Parameters

algorithms – list of (algorithm, algorithm_name, normalize).

Returns

list of algorithm_name.

dataset_to_observations_actions_rewards_list(dataset)[source]

_summary_

Parameters

dataset (_type_) – d4rl or torch format dataset obtained from generate_dataset_with_algorithm

Returns

the same as evalute_algorithms

done_calculator_standard(current_observation, step_count, reward, done=None, done_info=None)[source]
check whether the current episode is considered finished.

returns a boolean value indicated done or not, and a dictionary with information. here in done_calculator_standard, done_info looks like {“terminal”: boolean, “timeout”: boolean}, where “timeout” is true when episode end due to reaching the maximum episode length, “terminal” is true when “timeout” or episode end due to termination conditions such as env error encountered. (basically done)

Parameters
  • current_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • step_count ([int]) – step_count.

  • reward ([float]) – reward.

  • done ([bool], optional) – Defaults to None.

  • done_info ([dict], optional) – how the environment is finished. Defaults to None.

Returns

done and done_info.

Return type

[(float, dict)]

evalute_algorithms(algorithms, num_episodes=1, error_reward=None, initial_states=None, to_plt=True, plot_dir='./plt_results')[source]

when excecuting evalute_algorithms, the self.normalize should be False. algorithms: list of (algorithm, algorithm_name, normalize). algorithm has to have a method predict(observation) -> action: np.ndarray. num_episodes: number of episodes to run error_reward: overwrite self.error_reward initial_states: None, location of numpy file of initial states or a (numpy) list of initial states to_plt: whether generates plot or not plot_dir: None or directory to save plots returns: observations_list, actions_list, rewards_list

evenly_spread_initial_states(val_per_state, dump_location=None)[source]

Evenly spread initial states. This function is needed only if the environment has steady_observations.

Parameters

val_per_state (int) – how many values to sampler per state.

Returns: [initial_states]: evenly spread initial_states.

generate_dataset_with_algorithm(algorithm, normalize=None, num_episodes=1, error_reward=- 1000.0, initial_states=None, format='d4rl')[source]

this function aims to create a dataset for offline reinforcement learning, in either d4rl or pytorch format. the trajectories are generated by the algorithm, which interacts with this env initialized by initial_states. algorithm: an instance that has a method predict(observation) -> action: np.ndarray. if format == ‘d4rl’, returns a dictionary in d4rl format. else if format == ‘torch’, returns an object of type torch.utils.data.Dataset.

observation_beyond_box(observation)[source]

check if the observation is beyond the box, which is what we don’t want.

Parameters

observation ([np.ndarray]) – This is denormalized observation, as usual.

Returns

observation is beyond the box or not.

Return type

[bool]

observation_done_and_reward_calculator(current_observation, action, normalize=None, step_reward=None, done_info=None)[source]

the s, a, r, s, a rollout, with error checks.

Parameters
  • current_observation (list or np.ndarray) – This is denormalized observation, as usual.

  • previous_observation (np.ndarray) – This is denormalized observation, as usual.

  • action (np.ndarray) – This is denormalized action, as usual.

  • normalize (bool) – Defaults to None.

  • step_reward (float, optional) – The reward of current step. Defaults to None.

  • done_info (dict, optional) – Defaults to None.

Returns

This is the returned observation controlled by the normalize argument, for step function. [(float, bool, dict)]: reward, done and done_info. done_info looks like {“timeout”: boolean, “error_occurred”: boolean, “terminal”: boolean}, where “timeout” is true when episode end due to reaching the maximum episode length, “error_occurred”: is true when episode end due to env error encountered, “terminal” is true when “timeout” or episode end due to termination conditions such as product collection is finished. (basically done). “terminal” should be True whenever timeout or error_occurred.

Return type

observation (np.ndarray)

report_rewards(rewards_list, algo_names=None, save_dir=None)[source]

returns: mean and std of rewards over all episodes. since the rewards_list is not aligned (e.g. some trajectories are shorter than the others), so we cannot directly convert it to numpy array. we have to convert and unwrap the nested list. on_episodes first average the rewards_list over episodes, then compute the mean and std. all_rewards directly compute the mean and std for each step. # rewards_list[i][j][t] is algorithm_i_game_j_reward_t.

reset(initial_state=None, normalize=None)[source]

Required by gym, this function resets the environment and returns an initial observation.

reward_function_standard(previous_observation, action, current_observation, reward=None)[source]

the s, a, r, s, a calculation.

Parameters
  • previous_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • action ([np.ndarray]) – This is denormalized action, as usual.

  • current_observation ([np.ndarray]) – This is denormalized observation, as usual.

  • reward ([float]) – If reward is provided, directly return the reward.

Returns

reward.

Return type

[float]

sample_initial_state()[source]
set_initial_states(initial_states, num_episodes)[source]
step(action, normalize=None)[source]

Required by gym, his function performs one step within the environment and returns the observation, the reward, whether the episode is finished and debug information, if any.

Module contents