3.3. Model design

Once you have aggregated data to build a model and organized this data into a computable form, the next step of the modeling process is to design your model. The goal of model design is to select the most likely model for a system given everything you know about the system. To avoid overfitting, this selection can be done using cross validation and/or by penalizing larger models to preferentially choose smaller, more parsimonious models. However, it is often difficult to formalize all of your prior knowledge and your confidence in that knowledge and there is often insufficient data to utilize cross validation.

Broadly, there are two families of approaches to designing models: data-driven model design and expert-driven model design. Data-driven model design is a formal mathematical approach to model design that tries to identify the most likely model of a system given prior information about that system. The advantages of data-driven model design are that this approach is rigorous, automated, unbiased, and scalable to large models. However, this approach requires large amounts of data, typically far more than is available and this approach does not leverage heterogeneous prior information effectively. The advantage of expert-driven model design is that it leverages heterogeneous prior information effectively, including information that has not been described formally. However, expert-driven model design can be time-consuming and often leads to models that are biased toward the modeler’s preconceptions.

Due to the limitations of data-driven and expert-driven model design, several groups have developed hybrid model design approaches which automatically generate model design suggestions for modelers to review and accept or reject. For example, Henry et al. developed Model SEED to automatically seed expert-driven FBA metabolism models from models of related organisms, Latendresse et al. developed MetaFlux to seed expert-designed FBA metabolism models from PGDBs, and Kumar et al. developed GapFind to highlight gaps in expert-designed models.

Given that we do not yet have sufficient data to learn models, we must scale expert-driven model building to large models by decomposing models into multiple submodels, programmatically building models from structured sources of prior information, and automatically.

3.3.1. Software tools

Below are some of the most commonly used model design tools

3.3.2. Exercises

3.3.2.1. Required software

3.3.2.2. Expert-driven model design with COPASI

  1. Select a model from BioModels
  2. Use the COPASI GUI to recreate your chosen model

3.3.2.3. PGDB-driven model design with MetaFlux

  1. Download and install Pathway Tools which contains MetaFlux
  2. Download EcoCyc
  3. Follow the MetaFlux tutorial and use MetaFlux to construct an FBA model of Escherichia coli

3.3.2.4. Formal model selection

See the sckit-learn tutorial on model selection.

3.3.2.5. Bayesian network structure learning

See the pgmpy tutorial on learning Bayesian networks.