1. Introduction

A central goal of biological science is to quantitatively understand how genotype influences phenotype. However, despite decades of research, a growing wealth of experimental data, and extensive knowledge of individual molecules and individual pathways, we still do not understand how biological behavior emerges from the molecular level. For example, we do not understand how transcription factors, non-coding RNA, localization signals, degradation tags, and other regulatory systems interact to control protein expression.

Consequently, physicians still cannot interpret the pathophysiological consequences of genetic variation and bioengineers still cannot rationally design microorganisms. Instead, patients often have to try multiple drugs to find a single effective drug, which exposes patients to unnecessary drugs, prolongs disease, and increases costs. Similarly, bioengineers often have to rely on time-consuming and expensive trial and error methods such as directed evolution [haseltine2007synthetic][cobb2012directed].

Many engineering fields use mechanistic models to help understand and design complex systems such as cars [karnopp2012system], buildings [clarke2001energy], and transportation networks [cascetta2009transportation]. In particular, mechanistic models can help researchers conduct experiments with complete control, precision, and reproducibility.

To comprehensively understand cells, we must develop whole-cell (WC) computational models that predict cellular behavior by representing all of the biochemical activity inside cells [karr2015principles][macklin2014future][tomita2001whole][carrera2015build]. WC models could accelerate biological science by helping researchers unify our knowledge of cell biology, identify gaps in our understanding, and conduct complex experiments that would be infeasible in vitro. WC models could also help bioengineers design microorganisms and help physicians personalize medicine.

Since the 1950’s, researchers have been using modeling to understand cells. This has led to numerous models of individual pathways, including models of cell cycle regulation, chemical and electrical signaling, circadian rhythms, metabolism, and transcriptional regulation. Collectively, these efforts have used a wide range of mathematical formalisms including Boolean networks, flux balance analysis (FBA) [orth2010flux][bordbar2014constraint][feist2008growing], ordinary differential equations (ODEs), partial differential equations (PDEs), and stochastic simulation [szigeti2018blueprint].

Over the last 20 years, researchers have begun to build more comprehensive models that represent multiple pathways [tomita1999cell][covert2004integrating][chandrasekaran2010probabilistic][covert2008integrating][lee2008dynamic][carrera2014integrative][thiele2009genome]. Many of these models have built by combining multiple mathematically-dissimilar submodels of individual pathways into a single multi-algorithmic model [gonccalves2013bridging][takahashi2004multi].

Although we do not yet have all of the data and methods needed to model entire cells, we believe that WC models are rapidly becoming feasible due to ongoing advances in measurement and computational technology. In particular, we now have a wide array of experimental methods for characterizing cells, numerous repositories which contain much of the data needed for WC modeling, and a variety of tools for extrapolating experimental data to other organisms and conditions. In addition, we now have a wide range of modeling and simulation tools, including tools for designing models, rule-based model formats for describing complex models, and tools for simulating multi-algorithmic models. However, few of these resources support the scale required for WC modeling, and many these resources remain siloed.

Nevertheless, we and others are beginning to model entire cells [tomita1999cell][atlas2008incorporating][roberts2009long][karr2012whole][bordbar2015personalized]. In 2012, we and others reported the first dynamical model that represents all of the characterized genes in a cell [karr2012whole]. The model represents 28 pathways of the small bacterium Mycoplasma genitalium and predicts the essentiality of its genes with 80% accuracy.

However, several bottlenecks remain to build more comprehensive and more accurate WC models. In particular, we do not yet have all of the data needed for WC modeling or tools for designing, describing, or simulating WC models. To accelerate WC modeling, we must develop new methods for characterizing the single-cell dynamics of each metabolite and protein; develop new methods for scalably designing, simulating, and calibrating high-dimensional dynamical models; develop new standards for describing and verifying dynamical models; and assemble an interdisciplinary WC modeling community.

In this part, we summarize the scientific, engineering, and medical problems which are motivating WC modeling; propose the phenotypes that WC models should aim to predict and the molecular mechanisms that WC models should aim to represent; outline the fundamental challenges of WC modeling; describe why WC models are feasible by reviewing the existing methods, data, and models which could be leveraged for WC modeling; review the latest WC models and their limitations; outline the most immediate bottlenecks to WC modeling; propose a plan for achieving WC models; and summarize ongoing efforts to advance WC modeling.

1.1. Motivation for WC modeling

In our opinion, WC modeling is motivated by the needs to understand biology, personalize medicine, and design microorganisms. Biological science needs comprehensive models that represent the sequence, function, and interactions of each gene to help scientists holistically understanding cell biology. Similarly, precision medicine needs comprehensive models that predict phenotype from genotype to help physicians interpret the pathophysiological impact of genetic variation which can occur in any gene, and synthetic biology requires comprehensive models to help bioengineers rationally design microbial genomes for a wide range of applications.

In addition, WC models could help researchers address specific scientific problems such as determining how transcriptional regulation, non-coding RNA, and other pathways combine to regulate protein expression. Furthermore, each WC model could be used to address multiple questions, avoiding the need to build separate models for each question. However, few scientific problems require WC models, and we believe that most scientific problems would be more easily addressed with focused modeling.

Here, we describe the main applications which are motivating WC modeling. In the following sections, we define the biology that WC models must represent to support these applications and describe how to achieve such WC models.

1.1.1. Biological science: understand how genotype influences phenotype

Historically, the main motivation for WC modeling has been to help scientists understand how genotype and the environment determine phenotype, including how each individual gene, reaction, and pathway contributes to cellular behavior. For example, WC models could help researchers integrate heterogeneous experimental data about multiple genes and pathways. WC models could also help researchers gain novel insights into how pathways interact to control behavior. By comparison to experimental data, WC models could also help researchers identify gaps in our understanding. In addition, WC models would enable researchers to conduct experiments with complete control, infinite scope, and unlimited resolution, which would allow researchers to conduct complex experiments that would be infeasible in vitro.

1.1.2. Medicine: personalize medicine for individual genomes

Recent studies have shown that each patient has a unique genome, that genetic variation can occur in any gene and pathway, and that small genetic differences can cause patients to respond differentially to the same drugs. Together, this suggests that medicine could be improved by tailoring therapy to each patient’s genome. Physicians are beginning to use data-driven models to tailor medicine to a small number of well-established genetic variants that have large phenotypic effects. Tailoring medicine for all genetic variation requires WC models that represent every gene and that can predict the phenotypic effect of any combination of genetic variation. Such WC models would help physicians predict the most likely prognosis for each patient and identify the best combination of drugs for each patient (Figure 1.1). For example, WC models could help oncologists conduct personalized in silico drug trials to identify the best chemotherapy regimen for each patient. Similarly, WC models could help obstetricians identify diseases in early fetuses. In addition, WC models could help pharmacologists avoid harmful gene-drug interactions.

../_images/v1.png

Figure 1.1 WC models could transform medicine by helping physicians use patient-specific models informed by genomic data to design personalized prognoses and therapies.

1.1.3. Synthetic biology: rationally design microbial genomes

Synthetic biology promises to create microorganisms for a wide range of industrial, medical, security applications such as cheaply producing chemicals, drugs, and fuels; quickly detecting diseased tissue; killing pathogenic bacteria; and decontaminating industrial waste. Currently, microorganisms are often engineered using directed evolution [haseltine2007synthetic][cobb2012directed]. However, directed evolution is often time-consuming and limited to small phenotypic changes. Recently, researchers at the JCVI have begun to pioneer methods for chemically synthesizing entire genomes [gibson2010creation]. Realizing the full potential of this methodology requires WC models that can help bioengineers design entire genomes. For example, WC models could help bioengineers analyze the impact of synthetic circuits on host cells, design efficient chassis for synthetic circuits, and design bacterial drug delivery systems that can detect diseased tissue and synthesize drugs in situ.

1.2. The biology that WC models should aim to represent and predict

In the previous section, we argued that medicine and bioengineering need comprehensive models that can predict phenotype from genotype. Here, we outline the specific phenotypes that we believe that WC models should aim to predict and the specific physiochemical mechanisms that we believe that WC models should aim to represent to support medicine and bioengineering (Figure 1.2). In the following sections, we outline why we believe that WC models are becoming feasible and describe how to build and simulate WC models.

../_images/v14.png

Figure 1.2 The physical and chemical mechanisms that WC models should aim to represent (a) and the phenotypes that WC models should aim to predict (b).

1.2.1. Phenotypes that WC models should aim to predict

To support medicine and bioengineering, we believe that WC models should aim to predict the phenotypes of individual cells over their entire life cycles (Figure 1.2b). Specifically, we believe that WC models should aim to predict the following five levels of phenotypes:

  • Stochastic dynamics: To help physicians understand how genetic variation affects how cells respond to drugs, and to help bioengineers design microorganisms that are robust to stochastic variation, WC models should predict the stochastic behavior of each molecular species and molecular interaction. For example, this would help physicians design drugs that are robust to variation in RNA splicing, protein modification, and protein complexation. This would also help bioengineers design feedback loops that can control the expression of key RNA and proteins.
  • Temporal dynamics: To help physicians understand the impact of genetic variation on cell cycle regulation, and to help bioengineers control the temporal dynamics of microorganisms, WC models should predict the temporal dynamics of the concentration of each molecular species. For example, this would help physicians identify genetic variation that can disrupt cell cycle regulation and cause cancer. This would also help bioengineers design microorganisms that can perform specific tasks at specific times.
  • Spatial dynamics: To help physicians predict the intracellular distribution of drugs, and to help bioengineers use space to concentrate and insulate molecular interactions, WC models should predict the concentration of each molecular species in each spatial domain. For example, this would help physicians predict whether drugs interact with their intended targets and predict how quickly cells metabolize drugs. This would also help bioengineers maximize the metabolic activity of microorganisms by co-localizing enzymes and their substrates.
  • Single-cell variation: To help physicians understand how drugs affect populations of heterogeneous cells, and to help bioengineers design robust microorganisms, WC models should predict the single-cell variation of cellular behavior. For example, this would help physicians understand how chemotherapies affect heterogeneous tumors, and help bioengineers design reliable biosensors that activate at the same threshold irrespective of stochastic variation in RNA and protein expression.
  • Complex phenotypes: To help physicians understand the impact of variation on complex phenotypes and to help bioengineers design microorganisms that can perform complex phenotypes, WC models should predict complex phenotypes such as the cell shape, growth rate, and fate. For example, this would help physicians identify the primary variants responsible for disease and help physicians screen drugs in silico. This would also help bioengineers design sophisticated strains that can detect tumors, synthesize chemotherapeutics, and deliver drugs directly to tumors.

1.2.2. Physics and chemistry that WC models should aim to represent

To predict these phenotypes, we believe that WC models should aim to represent all of the chemical reactions inside cells and all of the physical processes that influence their rates (Figure 1.2a). Specifically, we propose that WC models aim to represent the following seven aspects of cells:

  • Sequences: To predict how genotype influence phenotype, including the contribution of each individual variant and gene, WC models should represent the sequence of each chromosome, RNA, and protein; the location of each feature of each chromosome such as genes, operons, promoters, and terminators; and the location of each site of each RNA and protein.
  • Structures: To predict how molecular species interact and react, WC models should represent the structure of each molecule, including atom-level information about small molecules, the domains and sites of macromolecules, and the subunit composition of complexes. For example, this would enable WC models to predict the metabolism of novel compounds.
  • Subcellular organization: To capture the molecular interactions that occur inside cells, WC models should represent the spatial organization of cells and the localization of each of metabolite, RNA, and protein species. For example, this would enable WC models to predict the spatial compartments in which molecular interactions occur.
  • Concentrations: To capture the molecular interactions that can occur inside cells, WC models should also represent the concentration of each molecular species in each organelle and spatial domain.
  • Molecular interactions: To capture how cells behave over time, WC models should represent the participants and effect of each molecular interaction, including the molecules that are consumed, produced, and transported, the molecular sites that are modified, and the bonds that are broken and formed. For example, this would enable WC models to capture the reactions responsible for cellular growth and homeostatic maintenance.
  • Kinetic parameters: To predict the temporal dynamics of cell behavior, WC models should represent the kinetic parameters of each interaction such as the maximum rate of each reaction and the affinity of each enzyme for its substrates and inhibitors. For example, this would enable WC models to predict the impact of genetic variation on the function of each enzyme.
  • Extracellular environment: To predict how the extracellular environment, including nutrients, hormones, and drugs, influences cell behavior, WC models should represent the concentration of each species in the extracellular environment. For example, this should enable WC models to predict the minimum media required for growth.

1.3. Fundamental challenges to WC modeling

In the previous section, we defined the biology that WC models should represent and predict. Building WC models that represent all of the biochemical activity inside cells and that can predict any cellular phenotype is challenging because this requires integrating molecular behavior to the cellular level across several spatial and temporal scales; assembling a complete molecular understanding of cell biology from incomplete, imprecise, and heterogeneous data; and simulating, calibrating, and validating computationally-expensive, high-dimensional models. Here, we describe these challenges to WC modeling. In the following sections, we describe emerging methods for overcoming these challenges to achieve WC models.

1.3.1. Integrating molecular behavior to the cell level over several spatiotemporal scales

The most fundamental challenge to WC modeling is integrating the behavior of individual species and reactions to the cellular level over several spatial and temporal scales. This is challenging because it requires accurate parameter values and scalable methods for simulating large models. Here, we summarize these challenges.

1.3.1.1. Sensitivity of phenotypic predictions to molecular parameter values

The first challenge to integrating molecular behavior to the cellular level is the sensitivity of model predictions to the values of critical parameters, which necessitates accurate parameter values. Accurately identifying these values is challenging because, as described below, it is challenging to optimize high-dimensional functions and because, as described in Section 1.3.1, our experimental data is incomplete and imprecise.

1.3.1.2. High computational cost of simulating large fine-grained models

A second challenge to integrating molecular behavior to the cellular level is the high computational cost of simulating entire cells with molecular granularity. For example, simulating one cell cycle of our first WC model of the smallest known freely living organism took a full core-day of an Intel E5520 CPU, or approximately \(1 \times 10^{15}\) floating-point operations [karr2012whole]. Based on this data, the fact that human cells are approximately 106 larger, and the fact that a typical WC simulation experiment will require at least 1,000 simulation runs, a typical WC simulation experiment of a human cell will require approximately 106 core-years. To simulate larger and more complex organisms, we must develop faster parallel simulators.

1.3.2. Assembling a unified molecular understanding of cells from imperfect data

In our opinion, the greatest challenge to WC modeling is assembling a unified molecular understanding of cell biology. As illustrated in Figure 1.3, this requires assembling comprehensive data about every molecular species and molecular interaction. For example, to model M. genitalium we reconstructed (a) its subcellular organization; (b) its chromosome sequence; (c) the location, length, direction and essentiality of each gene; (d) the organization and promoter of each transcription unit; (e) the expression and degradation rate of each RNA transcript; (f) the specific folding and maturation pathway of each RNA and protein species including the localization, N-terminal cleavage, signal sequence, prosthetic groups, disulfide bonds and chaperone interactions of each protein species; (g) the subunit composition of each macromolecular complex; (h) its genetic code; (i) the binding sites and footprint of every DNA-binding protein; (j) the structure, charge and hydrophobicity of every metabolite; (k) the stoichiometry, catalysis, coenzymes, energetics and kinetics of every chemical reaction; (l) the regulatory role of each transcription factor; (m) its chemical composition and (n) the composition of its growth medium [karr2013wholecellkb].

../_images/v4.png

Figure 1.3 WC models require comprehensive data about every molecular species and molecular interaction.

This is challenging because our data is incomplete, imprecise, heterogeneous, scattered, and poorly annotated. Here, we summarize these limitations and the challenges they present for WC modeling.

1.3.2.1. Incomplete data

The biggest limitation of our experimental data is that we do not have a complete experimental characterization of a cell. In particular, we have limited genome-scale data about individual metabolites and proteins, limited data about cell cycle dynamics, limited data about cell-to-cell variation, limited data about culture media, and limited data about cellular responses to genetic and environmental perturbations. Many genome-scale datasets are also incomplete. For example, most metabolomics and proteomics methods can only measure small numbers of metabolites and proteins.

1.3.2.2. Imprecise and noisy data

A second limitation of our experimental data is that many of our measurement methods are imprecise and noisy. For example, fluorescent microscopy cannot precisely quantitate single-cell protein abundances, single-cell RNA sequencing cannot reliably discern unexpressed RNA, and mass-spectrometry cannot reliably discern unexpressed proteins.

1.3.2.3. Heterogeneous experimental methods

A third limitation of our experimental data is that our data is highly heterogeneous because we do not have a single experimental technology that is capable of completely characterizing a cell. Rather, we have a wide range of methods for characterizing different aspects of cells at different scales with different levels of resolution. For example, mass-spectrometry can quantitate the concentrations of tens of metabolites, deep sequencing can quantitate the concentrations of tens of thousands RNA, and each biochemical experiment can quantitate one or a few kinetic parameters.

Consequently, our experimental data also spans a wide range of scales and units. For example, we have extensive molecular information about the participants in each metabolic reaction and their stoichiometries, but we only have limited information about the substrates of each protein chaperone. As a second example, we have extensive single-cell information about RNA expression, but we have limited single-cell data about metabolite concentrations.

1.3.2.4. Heterogeneous organisms and environmental conditions

A fourth limitation of our data is that we only have a small amount of data about each organism and environmental condition, and only a small amount of data from each laboratory. However, collectively, we have a large amount of data.

1.3.2.5. Siloed data

Another limitation of our data is that no resource contains all of the data needed for WC modeling. Rather, our data is scattered across a wide range of databases, websites, textbooks, publications, supplementary materials, and other resources. For example, ArrayExpress [kolesnikov2015arrayexpress] and the Gene Expression Omnibus [clough2016gene] (GEO) only contain RNA abundance data, PaxDb only contains protein abundance data [wang2015version], and SABIO-RK only contains kinetic data [wittig2012sabio]. Furthermore, many of these data sources use different identifiers and different units.

1.3.2.6. Insufficient annotation

Furthermore, much of our data is insufficiently annotated to understand its biological semantic meaning and provenance. For example, few RNA-seq datasets in ArrayExpress [kolesnikov2015arrayexpress] have sufficient metadata to understand the environmental condition that was measured, including the concentration of each metabolite in the growth media and the temperature and pH of the growth media. Similarly, few kinetic measurements in SABIO-RK [wittig2012sabio] have sufficient metadata to understand the strain that was measured.

1.3.3. Selecting, calibrating and validating high-dimensional models

A third fundamental challenge to WC modeling is the high-dimensionality of WC models which makes WC models susceptible to the “curse of dimensionality”, the need for more data to constrain high-dimensional models [keogh2011curse]. In particular, the curse of dimensionality makes it challenging to select, calibrate, and validate WC models because we do not yet have sufficient data to data to select among multiple possible WC models, avoid overfitting WC models, precisely determine the value of each parameter, or test the accuracy of every possible prediction. Furthermore, it is computationally expensive to select, calibrate, and validate high-dimensional models.

1.4. Feasibility of WC models

Despite the numerous challenges to WC modeling described in the previous section, we believe that WC modeling is rapidly becoming feasible due to ongoing technological advances throughout computational systems biology, bioinformatics, genomics, molecular cell biology, applied mathematics, computer science, and software engineering including methods for experimentally characterizing cells, repositories for sharing data, tools for building and simulating dynamical models, models of individual pathways, and model repositories. While substantial work remains to adapt and integrate these technologies into a unified framework for WC modeling, these technologies are already forming a strong intellectual foundation for WC modeling. Here, we review the technologies that we believe are making WC modeling feasible, and describe their present limitations for WC modeling. In the following section, we describe how we are beginning to leveraging these technologies to build and simulate WC models.

1.4.1. Experimental methods, data, and repositories

Here, we review advances in measurement methods, data repositories, and bioinformatics tools that are generating the data needed for WC modeling, aggregating this data into repositories, and producing tools for extrapolating data to other genotypes and environments.

1.4.1.1. Measurement methods

Advances in biochemical, genomic, and single-cell measurement are rapidly generating the data needed for WC modeling [macaulay2017single][altelaar2013next][fuhrer2015high] (Table 1.1). For example, Meth-Seq can assess epigenetic modifications [laird2010principles], Hi-C can determine the average structure of chromosomes [dekker2013exploring], ChIP-seq can determine protein-DNA interactions [park2009chip], fluorescence microscopy can determine protein localizations, mass-spectrometry can quantitate average metabolite concentrations, scRNA-seq [saliba2014single][kolodziejczyk2015technology] can quantitate the single-cell variation of each RNA [saliba2014single], FISH [lee2014highly] can quantitate the spatiotemporal dynamics and single-cell variation of the abundances of a few RNA, mass spectrometry can quantitate the average abundances of hundreds of proteins [dettmer2007mass][bantscheff2012quantitative], mass cytometry can quantitate the single-cell variation of the abundances of tens of proteins [bendall2012deep], and fluorescence microscopy can quantitate the spatiotemporal dynamics and single-cell variation of the abundances of a few proteins. However, improved methods are still needed to measure the dynamics of the entire metabolome and proteome.

Table 1.1 Types of experimental data that can be used to build, calibrate, and validate WC models.
Data type URL Reference
Metabolites    
Structure    
Mass spectrometry http://doi.org/10.1002/mas.20108 Dettmer et al., 2007
Concentration    
Fluorescence microscopy http://doi.org/10.1126/science.1243259 Zenobi, 2013
Mass spectrometry http://doi.org/10.1002/mas.20108 Dettmer et al., 2007
Spectrophotometry http://doi.org/10.1016/B978-0-12-416618-9.00005-4 TeSlaa and Teitell, 2014
DNA    
Structure    
DNA sequencing http://doi.org/10.1038/nbt1486 Shendure and Ji, 2008
Methylation sequencing http://doi.org/10.1038/nrg2732 Laird, 2010
Chromosome conformation capture http://doi.org/10.1038/nrg3454 Dekker et al., 2013
Concentration    
Flow cytometry http://doi.org/10.1016/j.it.2012.02.010 Bendall et al., 2012
RNA    
Structure    
RNA sequencing http://doi.org/10.1038/nrg2934 Ozsolak and Milos, 2011
Modification sequencing (ICE, MERIP-Seq) http://doi.org/10.1016/j.trsl.2014.04.003 Liu and Pan, 2015
X-ray crystallography http://doi.org/10.1016/S0076-6879(09)69006-6 Reyes et al., 2009
Localization    
Fluorescence in situ hybridization http://doi.org/10.1126/science.1250212 Lee et al., 2014
Transcription rate    
ChIP-seq http://doi.org/10.1038/nrg2641 Park, 2009
GRO-seq http://doi.org/10.1126/science.1162228 Core et al., 2008
Half-life    
Microarray timecourse http://doi.org/10.1101/gr.912603 Selinger et al., 2003
RNA sequencing timecourse http://doi.org/10.1038/nature10098 Schwanhäusser et al., 2011
Concentration    
Microarray http://doi.org/10.1038/35087138 Schulze and Downward, 2001
RNA sequencing http://doi.org/10.1038/nrg2934 Ozsolak and Milos, 2011
Fluorescence in situ hybridization http://doi.org/10.1126/science.1188308 Taniguchi et al., 2010
Proteins    
Structure    
Mass spectrometry http://doi.org/10.1126/science.1124619 Domon and Aebersold, 2006
Nuclear magnetic resonance spectroscopy http://doi.org/10.1146/annurev.biochem.73.011303.074004 Tugarinov et al., 2004
RNA sequencing http://doi.org/10.1038/nrg2934 Ozsolak and Milos, 2011
X-ray crystallography http://doi.org/10.1007/978-1-60327-159-2_3 Ilari and Savino, 2008
Localization    
Fluorescence microscopy http://doi.org/10.1126/science.1124618 Giepmans et al., 2006
Translation rate    
Ribosomal profiling http://doi.org/10.1038/nrg3645 Ignolia, 2014
Half-life    
Fluorescence timecourse http://doi.org/10.1098/rsob.140002 Knop and Edgar, 2014
Mass spectrometry timecourse http://doi.org/10.1038/nature10098 Schwanhäusser et al., 2011
Concentration    
Flow cytometry http://doi.org/10.1016/j.it.2012.02.010 Bendall et al., 2012
Fluorescence microscopy http://doi.org/10.1126/science.1124618 Giepmans et al., 2006
Mass cytometry http://doi.org/10.1016/j.it.2012.02.010 Bendall et al., 2012
Mass spectrometry http://doi.org/10.1126/science.1124619 Domon and Aebersold, 2006
Spectrophotometery http://doi.org/10.1016/S0076-6879(09)63008-1 Noble and Bailey, 2009
Interactions    
RNA-DNA    
CHIRP-Seq http://doi.org/10.1016/j.molcel.2011.08.027 Chu et al., 2011
Protein-metabolite    
Mass spectrometry http://doi.org/10.1126/science.1124619 Domon and Aebersold, 2006
Protein-DNA    
ChIP-seq http://doi.org/10.1038/nrg2641 Park, 2009
DNase-seq http://doi.org/10.1101/pdb.prot5384 Song and Crawford, 2010
Protein-RNA    
CLIP-seq http://doi.org/10.1002/wrna.31 Darnell, 2010
RIP-seq http://doi.org/10.1016/j.molcel.2010.12.011 Zhao et al., 2010
Protein-protein    
Co-immunoprecipitation http://doi.org/10.1101/pdb.prot3898 Sambrook and Russell, 2006
Tandem affinity purification http://doi.org/10.1016/j.pep.2010.04.009 Xu et al., 2010
Two-hybrid screen http://doi.org/10.3390/ijms10062763 Brückner et al., 2009
Reaction fluxes    
Isotopic labeling http://doi.org/10.1002/wsbm.1167 Klein and Heinzle, 2012
Phenotypic data    
Cell size    
Fluorescence microscopy http://doi.org/10.1146/annurev.cellbio.042308.113408 Muzzey and van Oudenaarden, 2009
Growth rates    
Spectrophotometery http://doi.org/10.1177/2211068214555414 Jensen et al., 2015
Division times    
Fluorescence microscopy http://doi.org/10.1002/cyto.a.20812 Wang et al., 2010
Motility, chemotaxis    
Fluorescence microscopy http://doi.org/10.1038/sj.emboj.7601227 Dormann and Weijer, 2006

1.4.1.2. Data repositories

Researchers are rapidly aggregating the experimental data needed for WC modeling into repositories (Table 1.2). This includes specialized repositories for individual types of data such as ECMDB [sajed2016ecmdb] and YMDB [ramirez2017ymdb] for metabolite concentrations; ArrayExpress [kolesnikov2015arrayexpress] and the Gene Expression Omnibus [clough2016gene] (GEO) for RNA abundances; PaxDb [wang2015version] for protein abundances; BiGG [king2015bigg] for metabolic reactions, and SABIO-RK for kinetic parameters [wittig2012sabio], as well as general purpose repositories such as FigShare [figshare2017], SimTk [simtk2017], and Zenodo [zenodo2017].

Some researchers are making the data in these repositories more accessible by providing common interfaces to multiple repositories such as BioMart [smedley2015biomart], BioServices [cokelaer2013bioservices], and Intermine [kalderimis2014intermine].

Other researchers are making the data in these repositories more accessible by integrating the data into meta-databases. For example, KEGG contains a variety of information about metabolites, proteins, reactions, and pathways [kanehisa2017kegg]; Pathway Commons contains extensive information about protein-protein interactions and pathways [cerami2010pathway]; and UniProt contains a multitude of information about proteins [uniprot2017uniprot].

In addition, some researchers are integrating information about individual organisms into PGDBs such as the BioCyc family of databases [caspi2016metacyc][keseler2017ecocyc]. These databases contain a wide range of information including the stoichiometries of individual reactions, the compositions of individual protein complexes, and the genes regulated by individual transcription factors. Because PGDBs already contain integrated data about a single organism, PGDBs could readily be leveraged to build WC models. In fact, Latendresse developed MetaFlux to build constraint-based models of metabolism from EcoCyc [latendresse2012construction].

Furthermore, meta-databases such as Nucleic Acid Research’s Database Summary [galperin201724th] and re3data.org [pampel2013making] contain lists of data repositories.

Most of these repositories have been developed by encouraging individual researchers to deposit their data or by employing curators to manually extract data from publications, supplementary files, and websites. In addition, researchers are beginning to use natural language processing to develop tools for automatically extracting data from publications [cohen2015darpa].

Table 1.2 Repositories that contain experimental data which can be used to build, calibrate, and validate WC models.
Database Content URL Reference
Species structures      
Metabolites      
ChEBI Compound structures https://www.ebi.ac.uk/chebi Hastings et al., 2016
KEGG Compound Compound structures http://www.genome.jp/kegg/compound Kanehisa et al., 2017
KEGG Glycan Glycan structures http://www.genome.jp/kegg/glycan Hashimoto et al., 2006
Metabolomics Workbench Metabolite Database Compound structures http://www.metabolomicsworkbench.org Sud et al., 2016
LIPID MAPS Lipid structures http://www.lipidmaps.org/data/structure Sud et al., 2007
PubChem Compound structures https://pubchem.ncbi.nlm.nih.gov Kim et al., 2016
DNA      
ArrayExpress Functional genomics data including Hi-C data http://www.ebi.ac.uk/arrayexpress Kolesnikov et al., 2015
GenBank DNA sequences https://www.ncbi.nlm.nih.gov/genbank Benson et al., 2017
GEO Functional genomics data including Hi-C data https://www.ncbi.nlm.nih.gov/geo Clough and Barrett, 2016
MethDB Methylation sequencing data http://www.methdb.net Grunau et al., 2001
RNA      
ArrayExpress Functional genomics data including RNA-seq data that encompasses initiation and termination sites http://www.ebi.ac.uk/arrayexpress Kolesnikov et al., 2015
GEO Functional genomics data including RNA-seq data that encompasses initiation and termination sites https://www.ncbi.nlm.nih.gov/geo Clough and Barrett, 2016
MODOMICS Post-transcriptional modifications http://modomics.genesilico.pl Machnicka et al., 2013
RNA Modification Database Post-transcriptional modifications http://mods.rna.albany.edu Cantara et al., 2011
Protein      
3d-footprint 3-dimensional footprints http://floresta.eead.csic.es/3dfootprint Contreras-Moreira, 2010
dbPTM Post-translational modifications http://dbptm.mbc.nctu.edu.tw Huang et al., 2016
PDB 3-dimensional structures http://www.rcsb.org Rose et al., 2017
RESID Post-translational modifications http://pir.georgetown.edu/resid Garavelli, 2004
UniMod Post-translational modifications http://www.unimod.org Creasy and Cottrell, 2004
UniProt Functional protein annotations including post-translational modifications http://www.uniprot.org The UniProt Consortium, 2017
Localization and signal sequences      
RNA      
Fly-FISH RNA localizations http://fly-fish.ccbr.utoronto.ca Wilk et al., 2006
RNALocate RNA localizations http://www.rna-society.org/rnalocate Zhang et al., 2017
Protein      
COMPARTMENTS Protein localizations for Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus, and Rattus norvegicus http://compartments.jensenlab.org Binder et al., 2014
Human Protein Reference Database Protein localizations for Homo sapiens http://www.hprd.org Prasad et al., 2009
LOCATE Protein localizations for Homo sapiens and Mus musculus http://locate.imb.uq.edu.au Sprenger et al., 2008
LocDB Protein localizations for Arabidopsis thaliana and Homo sapiens https://www.rostlab.org/services/locDB Rastogi and Rost, 2011
LocSigDB Protein localizations for eukaryotes http://genome.unmc.edu/LocSigDB Negi et al., 2015
OrganelleDB Protein localizations http://labs.mcdb.lsa.umich.edu/organelledb Wiwatwattana et al., 2007
PSORTdb Protein localizations for bacteria and archaea http://db.psort.org Peabody et al., 2016
UniProt Functional protein annotations including protein localizations http://www.uniprot.org The UniProt Consortium, 2017
Concentrations      
Metabolites      
BioNumbers Quantitative measurements of physical, chemical, and biological properties including metabolite concentrations http://bionumbers.hms.harvard.edu Milo et al., 2010
ECMBD Metabolite concentrations in Escherichia coli http://www.ecmdb.ca Sajed et al., 2016
HMDB Metabolite concentrations in Homo sapiens http://www.hmdb.ca Wishart et al., 2013
MetaboLights   https://www.ebi.ac.uk/metabolights Kale et al., 2016
YMDB Metabolite concentrations in Saccharomyces cerevisiae http://www.ymdb.ca Ramirez-Gaona et al., 2017
RNA      
ArrayExpress Functional genomics data including RNA abundances from microarray and RNA-seq experiments http://www.ebi.ac.uk/arrayexpress Kolesnikov et al., 2015
Expression Atlas RNA abundances across organisms and environmental conditions http://www.ebi.ac.uk/gxa Petryszak et al., 2016
GEO Functional genomics data including RNA abundances from microarray and RNA-seq experiments https://www.ncbi.nlm.nih.gov/geo Clough and Barrett, 2016
Proteins      
Review   http://doi.org/10.1002/pmic.201400302 Perez-Riverol et al., 2015
Human Protein Atlas Protein abundances for Homo sapiens http://www.proteinatlas.org Uhlén et al., 2015
PaxDb Protein abundances http://pax-db.org Wang et al., 2015
Plasma Proteome Database Protein abundances for Homo sapiens plasma http://plasmaproteomedatabase.org Nanjappa et al., 2014
PRIDE Mass-spectrometry proteomics data http://www.ebi.ac.uk/pride Vizcaíno et al., 2016
Interactions      
Protein-Metabolite, See also: Cofactors      
Review   http://doi.org/10.1016/j.jchromb.2013.11.043 Matsuda et al., 2014
DrugBank Drugs and their targets https://www.drugbank.ca Law et al., 2014
STITCH Drugs and their targets http://stitch.embl.de Szklarczyk et al., 2016
SuperTarget Drugs and their targets http://insilico.charite.de/supertarget Hecker et al., 2012
Therapeutic Targets Database Drugs and their targets http://bidd.nus.edu.sg/group/cjttd Zhu et al., 2012
Protein-DNA      
ArrayExpress Functional genomics data including ChIP-seq data of protein-DNA interations http://www.ebi.ac.uk/arrayexpress Kolesnikov et al., 2015
GEO Functional genomics data including ChIP-seq data of protein-DNA interations https://www.ncbi.nlm.nih.gov/geo Clough and Barrett, 2016
DBD Predicted transcription factors http://www.transcriptionfactor.org Wilson et al., 2008
DBTBS Bacillus subtilis transcription factors and the operons they regulate http://dbtbs.hgc.jp Sierro et al., 2008
ORegAnno Transcription factor binding sites http://www.oreganno.org Lesurf et al., 2016
TRANSFAC Transcription factor binding motifs http://genexplain.com/transfac Matys et al., 2003
UniProbe Transcription factor binding motifs http://thebrain.bwh.harvard.edu/uniprobe Hume et al., 2015
Protein-Protein      
Review   http://doi.org/10.1186/1479-7364-3-3-291 Lehne et al., 2009
ConsensusPathDB Homo sapiens molecular interactions including protein-protein interactions http://cpdb.molgen.mpg.de Kamburov et al., 2013
BioGRID Protein-protein interactions https://thebiogrid.org Chatr-aryamontri et al., 2017
CORUM Protein complex composition http://mips.helmholtz-muenchen.de/corum/  
DIP Protein-protein interactions http://dip.doe-mbi.ucla.edu Salwinski et al., 2004
IntAct Molecular interactions including protein-protein interactions http://www.ebi.ac.uk/intact Szklarczyk et al., 2017
STRING Protein-protein interactions https://string-db.org Kerrien et al., 2012
UniProt Function protein annotations including protein complex compositions http://www.uniprot.org The UniProt Consortium, 2017
Reactions      
Stoichiometries, catalysis      
BioCyc Reaction stoichiometries and catalysts https://biocyc.org Caspi et al., 2016
KEGG Reaction stoichiometries and catalysts http://www.genome.jp/kegg Kanehisa et al., 2017
MACiE Detailed reaction mechanisms http://www.ebi.ac.uk/thornton-srv/databases/MACiE Holliday et al., 2012
Rhea Reaction stoichiometries http://www.rhea-db.org Morgat et al., 2017
UniProt Reaction stoichiometries and catalysts http://www.uniprot.org The UniProt Consortium, 2017
Cofactors      
CoFactor Organic enzyme cofactors http://www.ebi.ac.uk/thornton-srv/databases/CoFactor Fischer et al., 2010
PDB 3-dimensional protein structures including cofactors http://www.rcsb.org Rose et al., 2017
UniProt Functional protein annotations including cofactors http://www.uniprot.org The UniProt Consortium, 2017
Rate laws and rate constants      
BioNumbers Quantitative measurements of physical, chemical, and biological properties including kinetic parameters http://bionumbers.hms.harvard.edu Milo et al., 2010
BRENDA Kinetic parameters and rate laws http://www.brenda-enzymes.org Schomburg et al., 2017
SABIO-RK Kinetic parameters and rate laws http://sabio.h-its.org Wittig et al., 2012
Pathways      
Metabolic      
Review   https://doi.org/10.1007/s00204-011-0705-2 Karp and Caspi, 2011
BioCyc Species-specific pathways https://biocyc.org Caspi et al., 2016
KEGG PATHWAY Species-specific pathways http://www.genome.jp/kegg/pathway.html Kanehisa et al., 2017
Signaling      
Review   https://doi.org/10.1093/database/bau126 Chowdhury and Sarkar, 2015
hiPathDB Metadatabase of Homo sapiens signaling pathways http://hipathdb.kobic.re.kr Yu et al., 2012
KEGG PATHWAY Pathways including signaling pathways http://www.genome.jp/kegg/pathway.html Kanehisa et al., 2017
NetPath Immune signaling pathways http://www.netpath.org Kandasamy et al., 2010
PANTHER Pathway Pathways including signaling pathways http://www.pantherdb.org/pathway Mi et al., 2017
Pathway Commons Metadatabase of signaling pathways http://www.pathwaycommons.org Cerami et al., 2011
Reactome Pathways including signaling pathways http://www.reactome.org Fabregat et al., 2016
WikiPathways Community curated pathways including signaling pathways http://www.wikipathways.org Kutmon et al., 2016
Meta-databases and meta-database tools      
Review   http://doi.org/10.1002/minf.201600035 Urdidiales-Nieto et al., 2017
BioCatalogue List of web services https://www.biocatalogue.org Bhagat et al., 2010
BioMart Tools for integrating data from multiple repositories http://www.biomart.org Kasprzyk, 2010
BioMoby Ontology-based messaging system for discovering data http://biomoby.open-bio.org BioMoby Consortium et al., 2008
BIOSERVICES Python APIs to several popular repositories https://pythonhosted.org/bioservices Cokelaer et al., 2013
BioSWR List of web services http://inb.bsc.es/BioSWR Repchevsky and Gelpi, 2014
ELIXIR Effort to develop a common data infrastructure for Europe https://www.elixir-europe.org Crosswell and Thornton, 2012
NAR Database Summary List of database papers published in Nucleic Acids Research database issues http://www.oxfordjournals.org/nar/database/c Galperin et al., 2017
re3data.org Registry List of data repositories http://www.re3data.org Pampel et al., 2013

1.4.1.3. Prediction tools

Accurate prediction tools can be a useful alternative to constraining models with direct experimental evidence. Currently, many tools can predict molecular properties such as the organization of genes into operons, RNA folds, and protein localizations (Table 1.3). For example, PSORTb can predict the localization of bacterial proteins [yu2010psortb] and TargetScan can predict the mRNA targets of small non-coding RNAs [agarwal2015predicting]. In particular, these tools can be used to impute missing data and extrapolate observations to other organisms, genetic conditions, and environmental conditions. However, many current prediction tools are not sufficiently accurate for WC modeling.

Table 1.3 Computational prediction tools that can generate data which can be used to build, calibrate, and validate WC models.
Tool Prediction(s) Language URL Reference
Metabolites        
Physical properties        
Review Survey of several chemoinformatic packages   http://doi.org/10.1186/1758-2946-3-37 O’Boyle et al., 2011
Chemistry Development Kit (CDK) Java libraries for processing chemical information Java https://cdk.github.io Steinbeck et al., 2006
Cinfony A common API to several cheminformatics toolkits Python http://cinfony.github.io O’Boyle and Hutchison, 2008
Indigo A toolkit for molecular fingerprinting, substructure searching, and visualization C++, Java, .Net, Python http://lifescience.opensource.epam.com/indigo  
JChem Tools for draw and visualizing molecules and searching chemical databases Java, .Net, REST https://www.chemaxon.com/download/jchem-suite Csizmadia, 2000
Open Babel Tools for searching, converting, analyzing, and storing chemical structures C++, Java, .Net, Python http://openbabel.org O’Boyle et al., 2011
RDKit Cheminformatics toolkit C++, Python http://www.rdkit.org  
Thermodynamics        
UManSysProp Estimates the standard Gibbs free energy of formation of organic molecules using the Joback group contribution method Python, REST http://umansysprop.seaes.manchester.ac.uk Joback and Reid, 1987; Topping et al., 2016
Web GCM Estimates the standard Gibbs free energy of formation of organic molecules using the Mavrovouniotis group contribution method REST http://doi.org/10.1529/biophysj.107.124784 Jankowski et al., 2008
DNA        
Promoters        
Review Review of promoter prediction methods for Homo sapiens   http://doi.org/10.1093/bioinformatics/17.suppl_1.S90 Pedersen et al., 1999
PePPER Predicts prokaryote promoters REST http://pepper.molgenrug.nl/index.php/prokaryote-promoters de Yong et al., 2012
Promoter Predicts vertebrate PolII promoters REST http://www.cbs.dtu.dk/services/Promoter Knudsen, 1999
PromoterHunter Predicts prokaryote promoters REST http://www.phisite.org/promoterhunter Klucar et al., 2010
Genes        
Review Review of several gene prediction software tools   https://cmgm.stanford.edu/biochem218/Projects%202007/Mcelwain.pdf McElwain, 2007
GeneMark Family of tools for predicting viral, prokaryotic, archaeal, and eukaryotic genes Linux executable, REST http://exon.gatech.edu/GeneMark Borodovsky and Lomsadze, 2011
GENESCAN Predicts plant and vertebrate genes Linux executable, REST http://genes.mit.edu/GENSCAN.html Burge and Karlin, 1997
GLIMMER Predicts viral, prokaryotic, and archaeal genes C, REST https://ccb.jhu.edu/software/glimmer Salzberg et al., 1998
Operons        
Review Survey of several operon prediction methods   https://doi.org/10.1093/bib/bbn019 Brouwer et al., 2008
DOOR Predicts prokaryotic operons REST http://csbl.bmb.uga.edu/DOOR Mao et al., 2014
OperonDB Estimates the likelihood that pairs of genes are in the same operon Perl, REST http://operondb.cbcb.umd.edu/cgi-bin/operondb/operons.cgi Ermolaeva et al., 2001
ProOpDB Predicts prokaryotic operons Java, REST http://operons.ibt.unam.mx/OperonPredictor Taboada et al., 2010
VIMSS Predicts prokaryotic and archaeal operons R, REST http://www.microbesonline.org/operons Price et al., 2005
Variant interpretation        
PolyPhen-2 Predicts the functional effects of amino acid substitutions C, REST http://genetics.bwh.harvard.edu/pph2 Adzhubei et al., 2013
PROVEAN Predicts the functional effects of amino acid substitutions and indels C++, REST http://provean.jcvi.org Choi and Chan, 2015
SIFT Predicts the functional effects of amino acid indels C++, REST http://sift.bii.a-star.edu.sg Hu and Ng, 2013
RNA        
Splice sites        
Review Review of methods for predicting splice sites   http://www.umd.be/HSF/Desmet_2010.pdf Desmet et al., 2010
GeneSplicer Predicts eukaryotic splice sites Java https://ccb.jhu.edu/software/genesplicer Pertea et al., 2001
Human Splicing Finder Identify and predict mutations’ effect on human splicing motifs REST http://www.umd.be/HSF3/ Desmet et al., 2009
NetGene2 Predicts splice sites in Arabidopsis thaliana, Caenorhabditis elegans, and Homo sapiens REST http://www.cbs.dtu.dk/services/NetGene2 Hebsgaard et al., 1996
NNSplice Predicts splice sites Drosophila melanogaster and Homo sapiens REST http://www.fruitfly.org/seq_tools/splice.html Reese et al., 1997
Secondary structure        
Review Review of methods for predicting RNA secondary structures   http://doi.org/10.1016/j.ymeth.2016.04.004 Lorenz et al., 2016
Mfold Predicts RNA secondary structures C, REST http://unafold.rna.albany.edu/?q=mfold Zuker, 2003
RNAstructure Predicts RNA and DNA secondary structures C++, Java http://rna.urmc.rochester.edu/RNAstructure.html Reuter and Mathews, 2010
ViennaRNA Predicts RNA secondary structures C, Perl, Python https://www.tbi.univie.ac.at/RNA Lorenz et al., 2011
Open reading frame        
ORF Finder Predicts open reading frames Linux executable, REST https://www.ncbi.nlm.nih.gov/orffinder Rombel et al., 2002
ORF Investigator Predicts open reading frames Windows executable https://sites.google.com/site/dwivediplanet/ORF-Investigator Dhar and Kumar, 2012
ORFPredictor Predicts open reading frames from EST and cDNA sequences Perl, REST http://bioinformatics.ysu.edu/tools/OrfPredictor.html Min et al., 2005
Terminators        
Review Review of prokaryotic transcription termination that cites several methods for predicting terminators.   http://doi.org/10.1016/j.jmb.2011.03.036 Peters et al., 2011
ARNold Predicts prokaryotic rho-independent terminators REST http://rna.igmors.u-psud.fr/toolbox/arnold Gautheret D and Lambert A, 2001
FindTerm Predicts prokaryotic rho-independent terminators REST http://www.softberry.com/berry.phtml?topic=findterm&group=programs&subgroup=gfindb Solovyev and Salamov, 2011
GeSTer Predicts prokaryotic rho-independent terminators REST http://pallab.serc.iisc.ernet.in/gester Mitra et al., 2011
TransTermHP Predicts prokaryotic rho-independent terminators C++ http://transterm.cbcb.umd.edu Kingsford et al., 2007
Proteins        
Localization        
Review Review of methods for predicting the subcellular localization of prokaryotic and eukaryotic proteins   http://doi.org/10.1002/pmic.201000274 Imai and Nakai, 2010
Review Review of methods for predicting the subcellular localization of prokaryotic proteins   http://doi.org/10.1038/nrmicro1494 Gardy and Brinkman, 2006
Cell-PLoc Predicts the subcellular localization of proteins for multiple species REST http://www.csbio.sjtu.edu.cn/bioinf/Cell-PLoc-2 Chou and Shen, 2010
MultiLoc Predicts the subcellular localization of proteins for multiple species Python, REST http://abi.inf.uni-tuebingen.de/Services/MultiLoc2 Blum et al., 2009
PSORTb Predicts the subcellular localization of prokaryotic and archaeal proteins C++, Perl, REST http://www.psort.org/psortb Yu et al., 2010
SecretomeP Predicts signal peptide-independent protein secretion REST, tcsh http://www.cbs.dtu.dk/services/SecretomeP Bendtsen et al., 2004
WoLF PSORT Predicts the subcellular localization of eukaryotic proteins Perl, REST https://wolfpsort.hgc.jp Horton et al., 2007
Signal sequence        
Review Architecture, function and prediction of long signal peptides   http://doi.org/10.1093/bib/bbp030 Hiss and Schneider, 2009
Phobius Predict protein transmembrane topology and signal peptides from AA sequences Java, REST http://phobius.sbc.su.se Käll et al., 2007
PRED-LIPO Predict lipoprotein and secretory signal peptides in gram-positive bacteria REST http://bioinformatics.biol.uoa.gr/PRED-LIPO Bagos et al., 2008
PRED-SIGNAL Predict signal peptides in archaea REST http://bioinformatics.biol.uoa.gr/PRED-SIGNAL Bagos et al., 2009
SignalP Predict signal peptide cleavage sites in prokaryotic and eukaryotic proteins Perl, REST http://www.cbs.dtu.dk/services/SignalP Petersen et al., 2011
Disulfide bonds        
Review Review of methods predicting disulfide bonds   http://doi.org/10.2174/138920307780831848 Tsai et al., 2007
Review Review of methods predicting disulfide bonds   http://doi.org/10.4137/EBO.S25349 Márquez-Chamorro and Aguilar-Ruiz, 2015
Cyscon A consensus model for predicting disulfide bonds REST http://www.csbio.sjtu.edu.cn/bioinf/Cyscon Yang et al., 2015
DIANNA Predicts disulfide bonds Python, REST http://clavius.bc.edu/~clotelab/DiANNA Ferrè F and Clote P, 2006
Dinsolve Predicts disulfide bonds REST http://hpcr.cs.odu.edu/dinosolve Yaseen and Li, 2013
DIPro Predicts disulfide bonds REST, Perl http://scratch.proteomics.ics.uci.edu Cheng et al., 2006
DISULFIND Predicts disulfide bonds REST http://disulfind.dsi.unifi.it Ceroni et al., 2006
Complex abundance        
SiComPre Predicts the abundances of Homo sapiens and Saccharomyces cerevisiae protein complexes C++, Java, Python http://www.cosbi.eu/research/prototypes/sicompre Rizzetto et al., 2015
Half-lives        
N-End rule Predicts the half-lives of Escherichia coli, Saccharomyces cerevisiae and mammalian (rabit) proteins REST http://web.expasy.org/protparam Bachmair et al., 1986
Interactions        
miRNA targets        
Review Review of methods for predicting miRNA targets   http://doi.org/10.1016/j.cell.2009.01.002 Bartel, 2009
Review Review of methods for predicting miRNA targets   https://doi.org/10.3389/fgene.2014.00023 Peterson et al., 2014
DIANA-microT-CDS Predicts miRNA targets in Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, and Mus musculus REST http://www.microrna.gr/microT-CDS Reczko et al., 2012
miRSearch Predicts miRNA targets in Homo sapiens, Mus musculus, and Rattus norvegicus REST https://www.exiqon.com/miRSearch  
MirTarget Predicts miRNAs targets in several animals REST http://mirdb.org Wang, 2016
PITA Predicts miRNA targets in Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, and Mus musculus C, Perl, REST https://genie.weizmann.ac.il/pubs/mir07 Kertesz et al., 2007
STarMir Predicts miRNA targets in Caenorhabditis elegans, Homo sapiens, and Mus musculus Perl, R, REST http://sfold.wadsworth.org/cgi-bin/starmirtest2.pl Lui et al., 2013
TargetScan Predicts miRNA targets in several animals Perl, REST http://www.targetscan.org Agarwal et al., 2015
Protein-DNA binding sites        
Review Review of tools for predicting transcription factor binding sites   http://doi.org/10.1038/nbt1053 Tompa et al., 2005
Review Review of tools for predicting transcription factor binding sites   http://doi.org/10.1186/s12859-016-1298-9 Jayaram et al., 2016
DBD Predicts DNA-binding domains of transcription factors REST http://www.transcriptionfactor.org Wilson et al., 2008
JASPAR Predicts transcription factor binding motifs Perl, Python, R, REST, Ruby http://jaspar.genereg.net Mathelier et al., 2016
Weeder Predicts likely transcription factor binding motifs C++, REST http://doi.org/10.1093/nar/gkh465 Pavesi et al., 2004
Chaperones        
BiPPred Predicts the interactions of mammalian proteins with chaperone BiP REST https://www.bioinformatics.wzw.tum.de/bippred Schneider et al., 2016
cleverSuite Predicts the interactions of Escherichia coli proteins with chaperone DnaK/GroEL REST http://s.tartaglialab.com/clever_suite Klus et al., 2014
LIMBO Predicts the interactions of Escherichia coli proteins with chaperone DnaK REST http://limbo.switchlab.org/limbo-analysis Van Durme et al., 2009
Reaction center and atom mapping        
Review Review of methods for reaction mapping and reaction center detection   http://doi.org/10.1002/wcms.1140 Chen et al., 2013
CAM Predicts the mapping of reactant to product atoms C++ http://www.bioinf.uni-freiburg.de/Software/CAM Mann et al., 2014
CLCA Predicts the mapping of reactant to product atoms REST http://www.maranasgroup.com/metrxn Kumar and Maranas, 2014
MWED Predicts the mapping of reactant to product atoms Lisp http://doi.org/10.1021/ci3002217 Latendresse et al., 2012
ReactionDecoder Predicts the mapping of reactant to product atoms Java https://github.com/asad/ReactionDecoder Rahman et al., 2016
ReactionMap Predicts the mapping of reactant to product atoms REST http://cdb.ics.uci.edu/cgibin/reactionmap/ReactionMapWeb.py Fooshee et al., 2013

1.4.2. Modeling and simulation tools

Here, we review several advances in modeling and simulation technology that we believe are beginning to enable researchers to aggregate and organize the data needed for WC modeling and design, describe, simulate, calibrate, verify, and analyze WC models.

1.4.2.1. Data aggregation and organization tools

To make the large amount of publicly available data usable for modeling, researchers are developing tools such as BioServices [cokelaer2013bioservices] for programmatically accessing repositories and using PGDBs to organize the data needed for modeling. PGDBs are well-suited to organizing the data needed for WC models because they support structured representations of metabolites, DNA, RNA, proteins, and their interactions. However, traditional PGDBs provided limited support for non-metabolic pathways and quantitative data. Consequently, we are developing WholeCellKB, a PGDB specifically designed for WC modeling [karr2013wholecellkb].

1.4.2.2. Model design tools

Several software tools have been developed for designing models of individual cellular pathways including BioUML [kolpakov2011biouml], CellDesigner [matsuoka2014modeling], COPASI [bergmann2017copasi], JDesigner [sauro2003next], and Virtual Cell [resasco2012virtual] which support dynamical modeling; RuleBender which supports rule-based modeling [smith2012rulebender]; and COBRApy [ebrahim2013cobrapy], FAME [boele2012fame], and RAVEN [agren2013raven] which support constraint-based metabolic modeling; and (Table 1.4).

Recently, researchers have developed several tools that support some of the features needed for WC modeling. This includes SEEK which helps researchers design models from data tables [wolstencroft2015seek], Virtual Cell which helps researchers design models from KEGG pathways [resasco2012virtual][kanehisa2017kegg], MetaFlux which helps researchers design metabolic models from PGDBs [latendresse2012construction], the Cell Collective [helikar2013cell] and JWS Online [du2013jws] which help researchers build models collaboratively, PySB which helps researchers design models programmatically [lopez2013programming], and semanticSBML [krause2009annotation] and SemGen [neal2014reappraisal] which help researchers merge models.

Table 1.4 Software tools that can be used to help build, calibrate, validate, simulate, visualize, and analyze WC models.
Tool URL Reference
Data aggregation tools
BioCatalogue https://www.biocatalogue.org Bhagat et al., 2010
BIOSERVICES https://pythonhosted.org/bioservices Cokelaer et al., 2013
Data organization tools
GMOD http://gmod.org Papanicolaou and Heckel, 2010
Pathway Tools http://brg.ai.sri.com/ptools Karp et al., 2016
WholeCellKB http://www.wholecellkb.org Karr et al., 2013
Model design tools
CellDesigner http://www.celldesigner.org Matsuoka et al., 2014
COPASI http://copasi.org Mendes et al., 2009
JWS Online http://jjj.biochem.sun.ac.za Olivier and Snoep, 2004
MetaFlux http://brg.ai.sri.com/ptools Latendresse et al., 2012
PhysioDesigner http://www.physiodesigner.org Asai et al., 2012
RAVEN http://biomet-toolbox.org/index.php?page=downtools-raven Agren et al., 2013
RuleBender http://bionetgen.org/index.php/Quick_Start Smith et al., 2012
VirtualCell http://vcell.org Schaff et al., 2016
Model testing and verification tools
biolab http://www.lehman.edu/academics/cmacs/bio-lab.php Clarke at al., 2008
MEMOTE https://memote.readthedocs.io  
SBML-to-PRISM http://www.prismmodelchecker.org/sbml  
Model description languages
BioNetGen http://bionetgen.org Harris et al., 2016
BioPAX http://www.biopax.org Demir et al., 2010
CellML https://www.cellml.org Cuellar et al., 2015
kappa http://dev.executableknowledge.org Wilson-Kanamori et al., 2015
ML-Rules http://jamesii.informatik.uni-rostock.de/jamesii.org/ Maus et al., 2011
PySB http://pysb.org/ Lopez et al., 2013
SBML http://sbml.org Hucka et al., 2015
Simulation description languages
SED-ML http://sed-ml.org Waltemath et al., 2011
SESSL http://sessl.org Ewald and Uhrmacher, 2014
Simulators
cobrapy http://opencobra.github.io/cobrapy Ebrahim et al., 2013
COPASI http://copasi.org Mendes et al., 2009
ECell http://www.e-cell.org Takahashi et al., 2003
Lattice Microbes http://www.scs.illinois.edu/schulten/lm Hallock et al., 2014
libRoadRunner http://libroadrunner.org Somogyi et al., 2015
NFSim http://michaelsneddon.net/nfsim Sneddon et al., 2011
VirtualCell http://vcell.org Schaff et al., 2016
Simulation result formats
HDF5 https://support.hdfgroup.org/HDF5 Folk et al., 2011
NuML https://github.com/numl/numl Dada et al., 2017
SBRML http://www.comp-sys-bio.org/SBRML.html Dada et al., 2010
Simulation result databases
Bookshelf http://sbcb.bioch.ox.ac.uk/bookshelf Vohra et al., 2010
Dynameomics http://www.dynameomics.org van der Kamp et al., 2010
SEEK https://fair-dom.org/platform/seek Wolstencroft et al., 2011
WholeCellSimDB http://www.wholecellsimdb.org Karr et al., 2014
Visualization tools
Vega https://vega.github.io Satyanarayan et al., 2017
The Visualization Toolkit (VTK) http://www.vtk.org Hanwell et al., 2015
WholeCellViz http://www.wholecellviz.org Lee et al., 2013
Workflow management tools
Galaxy https://usegalaxy.org Walker et al., 2016
Taverna http://www.taverna.org.uk Wolstencroft et al., 2013
VizTrails https://www.vistrails.org Freire and Silva, 2012

However, none of these tools are well-suited to WC modeling because none of these tools support all of the features needed for WC modeling including programmatically designing models from large data sources such as PGDBs; collaboratively designing models over a web-based interface; designing composite, multi-algorithmic models; representing models in terms of rule patterns; and recording the data sources and assumptions used to build models.

1.4.2.3. Model selection tools

Several methods have also been developed to help researchers select among multiple potential models, including likelihood-based, Bayesian, and heuristic methods [kirk2013model]. ABC-SysBio [liepe2014framework][toni2009approximate], ModelMage [flottmann2008modelmage], and SYSBIONS [johnson2014sysbions] are some of the most advanced model selection tools. However, these tools only support deterministic dynamical models.

1.4.2.4. Model refinement tools

Several tools have been developed for refining models, including using physiological data to identify molecular gaps in metabolic models and using databases of molecular mechanisms to fill molecular gaps in metabolic models [orth2010systematizing][blais2013linking]. GapFind uses mixed integer linear programming to identify all of the metabolites that cannot be both produced and consumed in metabolic models, one type of molecular gap in metabolic models [kumar2007optimization]. GapFill [kumar2007optimization], OMNI [herrgaard2006identification], and SMILEY [reed2006systems] use linear programming to identify the most parsimonious set of reactions from reaction databases such as KEGG [kanehisa2017kegg] to fill molecular gaps in metabolic models. FastGapFill is one of the most efficient of these gap filling tools [latendresse2014efficiently]. GrowMatch extends gap filling to find the most parsimonious set of reactions that not only fill molecular gaps in metabolic models, but also correct erroneous gene essentiality predictions [kumar2009growmatch]. ADOMETA [kharchenko2006identifying], GAUGE [hosseini2017discovering], likelihood-based gap filling [benedict2014likelihood], MIRAGE [vitkin2012mirage], PathoLogic [green2004bayesian] and SEED [osterman2006hidden] extend gap filling further by using sequence homology and other genomic data to identify the genes which most likely catalyze missing reactions in metabolic networks. However, these tools are only applicable to metabolic models.

1.4.2.5. Model formats

Several formats have been developed to represent cell models including formats such as CellML [garny2008cellml] that represent models as collections of variables and equations, formats such as SBML [hucka2003systems] that represent models as collections of species and reactions, and more abstract formats such as BioNetGen [harris2016bionetgen], Kappa [danos2004formal], and ML-Rules [maus2011rule] that represent models as collections of species and rule patterns.

The Systems Biology Markup Language (SBML) was developed in 2002 to represent dynamical models that can be simulated by integrating ordinary differential equations or using the stochastic simulation algorithm, as well as the semantic biological meaning of models. Recently, SBML has been extended to support a wide range of models through the development of several new packages. The flux balance constraints package supports constraint-based models, the qualitative models package supports logical models, the spatial processes package support spatial models that can be simulated by integrating PDEs, the multistate multicomponent species package supports rule-based model descriptions, and the hierarchical model composition package supports composite models. SBML is by far the most widely supported and commonly used format for representing cell models. For example, SBML is supported by COPASI [bergmann2017copasi], the most commonly used cell modeling software program and BioModels, the most commonly used cell model repository [chelliah2015biomodels]. However, SBML creates verbose model descriptions, the multistate multicomponent species package only supports a few types of combinatorial complexity, SBML does not directly support multi-algorithmic models, and SBML cannot represent model provenance including the data sources and assumptions used to build models [waltemath2016toward].

More recently, Faeder and others have developed BioNetGen [harris2016bionetgen] and other rule-based formats to efficiently describe the combinatorial complexity of protein-protein interactions. These formats enable researchers to describe models in terms of species and reaction patterns which can be evaluated to generate all of the individual species and reactions in a model. This abstraction helps researchers describe reactions directly in terms of their chemistry, describe large models concisely, and avoid errors in enumerating species and reactions. Models that are described in rule-based formats such as BioNetGen can be simulated either by enumerating all of the possible species and reactions and then simulating the expanded model via conventional deterministic or stochastic dynamical simulation methods, or via network-free simulation which iteratively discovers individual species and reactions during simulation [sneddon2011efficient]. BioNetGen is the most commonly used rule-based modeling format and NFsim is the most commonly used network-free simulator. However, BioNetGen only supports few types of combinatorial complexity, BioNetGen does not support composite or multi-algorithmic models, BioNetGen cannot represent the semantic biological meaning of models, and BioNetGen cannot represent model provenance.

1.4.2.6. Simulation algorithms

Several algorithms have been developed to simulate cells with a wide range of granularity including algorithms for integrating systems of ODEs and PDEs, stochastic simulation algorithms, algorithms for simulating logical networks and Petri nets, and hybrid algorithms for co-simulating models that are composed of mathematically-dissimilar submodels.

The most commonly used algorithms to simulate cell models include algorithms for integrating systems of ODEs. These algorithms are best suited to simulating well-characterized and well-mixed systems that involve large concentrations that are robust to stochastic fluctuations. These algorithms are poorly suited to simulating stochastic processes that involve small concentrations, as well as poorly characterized pathways with little kinetic data. Consequently, ODE integration algorithms are poorly suited for WC modeling.

Stochastic simulation algorithms such as the Stochastic Simulation Algorithm (SSA) or Gillespie’s Algorithm [gillespie1977exact], newer, more efficient implementations of SSA such as the Gibson-Bruck method and RSSA-CR [thanh2017efficient], and approximations of SSA such as tau leaping, are commonly used to simulate pathways that involve small concentrations that are susceptible stochastic variation. However, these algorithms are only suitable for dynamical models which require substantial kinetic data, they are computationally expensive, especially for models that include reactions that have high fluxes, and they are limited to models with small state spaces. Consequently, stochastic simulation algorithms are poorly suited for simulating WC models.

Network-free simulation algorithms are stochastic simulation algorithms for efficiently simulating rule-based models without enumerating every possible species and reaction prior to simulation and instead discovering the active species and reactions during simulation. Unlike traditional stochastic simulation algorithms, network-free simulation algorithms can represent large models that have combinatorially large or even infinite state spaces. Otherwise, network-free stochastic simulation algorithms have the same limitations as other stochastic simulation algorithms.

FBA is the second-most commonly used algorithm for simulating cell models. FBA predicts the steady-state flux of each metabolic reaction using detailed information about the stoichiometry and catalysis of each reaction, a small amount of quantitative data about the chemical composition of cells, a small amount of data about the exchange rate of each extracellular nutrient, and the assumption that metabolism has evolved to maximize the rate of cellular growth. However, FBA has limited ability to predict metabolite concentrations and temporal dynamics, and its assumptions are largely only applicable to microbial metabolism. Consequently, FBA is not well-suited to simulating entire cells.

Logical simulation algorithms are frequently used for coarse-grained simulations of transcriptional regulation and other pathways for which we have limited kinetic data. Logical simulations are computationally efficient because they are coarse-grained. However, logical simulation algorithms are poorly suited to WC modeling because they cannot generate detailed quantitative predictions, and therefore have limited utility for medicine and bioengineering.

Multi-algorithmic simulations are ideal for WC modeling because they can simulate models that include fine-grained representations of well-characterized pathways, as well as coarse-grained representations of poorly-characterized pathways. Takahashi et al. developed one of the first algorithms for co-simulating multiple mathematically-dissimilar submodels [takahashi2004multi]. However, their algorithm is not well-suited to WC modeling because it does not support FBA or network-free simulation. Recently, we and others developed a multi-algorithm simulation meta-algorithm which supports ODE integration, conventional stochastic simulation, network-free stochastic simulation, FBA, and logical simulation [karr2012whole]. However, our algorithm violates the arrow of time and is not scalable to large models.

1.4.2.7. Simulation experiment formats

The Minimum Information About a Simulation Experiment (MIASE) guidelines have been developed to establish the minimum metadata that should be provided about a simulation experiment to enable other researchers to reproduce and understand the simulation [waltemath2011minimum]. The Simulation Experiment Description Markup Language (SED-ML) [waltemath2011reproducible] and the Simulation Experiment Specification via a Scala Layer (SESSL) [ewald2014sessl] formats have been developed to represent simulation experiments. Both formats are capable of representing all of the model parameters and simulator arguments needed to simulate a model. However, both formats are limited to a small range of model formats and simulators. SED-ML is limited to models that are represented using XML-based formats such as SBML, and SESSL is currently limited to Java-based simulators. Consequently, neither is currently well-suited to WC modeling.

1.4.2.8. Simulation tools

Numerous tools have been developed to simulate cell models including the BioUML [kolpakov2011biouml], Cell Collective [helikar2013cell], COBRApy [ebrahim2013cobrapy], COPASI [bergmann2017copasi], E-Cell [dhar2006cell], FAME [boele2012fame], iBioSim [myers2009ibiosim], libRoadRunner [somogyi2015libroadrunner], JWS Online [du2013jws], NFsim [sneddon2011efficient], RAVEN [agren2013raven], and Virtual Cell [resasco2012virtual].

COPASI is the most commonly used simulation tool. COPASI supports several deterministic, stochastic, and hybrid deterministic/stochastic simulation algorithms. However, COPASI does not support network-free stochastic simulation, FBA, logical, or multi-algorithmic simulation and COPASI does not support high-performance parallel simulation of large models.

Virtual Cell supports several deterministic, stochastic, hybrid deterministic/stochastic, network-free, and spatial simulation algorithms. However, Virtual Cell does not support FBA or multi-algorithmic simulations and Virtual Cell does not support high-performance parallel simulation of large models.

COBRApy, FAME, and RAVEN support FBA of metabolic models. However, these packages provide no support for other types of models.

E-Cell is one of the only simulation programs that supports multi-algorithmic simulation. However, E-Cell does not support FBA or rule-based simulation, and E-Cell does not scale well to large models.

Several tools including cupSODA [nobile2013cupsoda], cuTauLeaping [nobile2014cutauleaping], and Rensselaer’s Optimistic Simulation System (ROSS) [carothers2002ross] have been developed to simulate models in parallel. However, cupSODA only supports deterministic simulation, cuTauLeaping only supports network-based stochastic simulation, cupSODA and cuTauLeaping only support GPUs, and ROSS is a low-level, general-purpose framework for distributed CPU simulation.

1.4.2.9. Calibration tools

Accurate parameter values are essential for reliable predictions. Many methods have been developed to calibrate models by numerically optimizing the values of their parameters, including derivative-based initial value methods and stochastic multiple shooting methods [banga2008parameter].

Several complementary methods have also been developed to optimize computationally-expensive, high-dimensional functions, including surrogate modeling, distributed optimization, and automatic differentiation. Surrogate modeling, which is also referred to as function approximation, metamodeling, response surface modeling, and model emulation, promises to reduce the computational cost of numerical optimization by optimizing a computationally cheaper model which approximates the original model [forrester2009recent][wang2014evaluation][halloran2011adaptive][jones2001taxonomy]. Surrogate modeling has been used in several fields including aerospace engineering [ong2003evolutionary], hydrology [razavi2012numerical], and petroleum engineering [queipo2002surrogate]. However, further work is needed to develop methods for efficiently generating reduced surrogate WC models.

Distributed optimization is also a promising approach for optimizing computationally expensive functions. Distributed optimization uses multiple agents, each simultaneously employing the same algorithm on different regions, to quickly identify optima [panait2005cooperative][palomar2010convex]. Furthermore, agents can cooperate by exchanging information. Distributed optimization has been used in several fields including aerospace and electrical engineering [raffard2004distributed][rabbat2004distributed] and molecular dynamics [chen2006geometric].

Another promising approach for optimizing computationally expensive functions is automatic differentiation. Automatic differentiation is an efficient technique for analytically computing the derivative of a function [rall1981automatic]. Automatic differentiation can be used to make derivative-based optimization methods tractable in cases where finite difference calculations are prohibitively expensive. Automatic differentiation has been used to identify parameters in chemical engineering [ramachandran2010effective], biomechanics [bucker2006automatic], and physiology [schumann2013nonlinear].

Several software tools have also been developed for calibrating cell models [chis2011structural][ashyraliyev2009systems][chou2009recent][sun2012parameter][moles2003parameter]. Some of the most advanced model calibration tools include DAISY which can evaluate the identifiability of a model [bellu2007daisy], ABC-SysBio which uses approximate Bayesian computation [liepe2014framework], saCeSS which supports distributed, collaborative optimization [penas2017parameter], and SBSI which supports several distributed optimization methods [adams2013sbsi]. Some of the most popular modeling tools, including COPASI [bergmann2017copasi] and Virtual Cell [resasco2012virtual], also provide model calibration tools. However, none of these tools support multi-algorithmic models. To efficiently calibrate WC models, we should combine numerical optimization methods with additional techniques such as reduced surrogate modeling, distributed computing, and automatic differentiation.

1.4.2.10. Verification tools

Several tools have been developed to verify cell models, including formal verification tools that seek to prove or refute mathematical properties of models and informal verification tools that help modelers organize and evaluate computational tests of models. BioLab [clarke2008statistical] and PRISM [kwiatkowska2011prism] are formal tools for verifying BioNetGen-encoded and SBML-encoded models, respectively. Memote [lieven2017memote] and SciUnit [omar2014collaborative] are unit testing frameworks for organizing computational tests of models. Continuous integration tools such as CircleCI [circleci2017] and Jenkins [jenkins2017] can be used to regularly verify models each time they are modified and pushed to a version control system (VCS) such as Git [git2017].

1.4.2.11. Simulation results formats

HDF5 is an ideal format for storing simulation results [folk2011overview]. In particular, HDF5 supports hierarchical data structures, HDF5 supports compression, HDF5 supports chunking to facilitate fast retrieval of small slices of large datasets, HDF5 can store both simulation results and their metadata, and there are HDF5 libraries available for several languages including C++, Java, MATLAB, Python, and R.

1.4.2.12. Simulation results databases

Several database systems have been developed to organize simulation results for visual and mathematical analysis and disseminate simulation results to the community [vohra2010bookshelf][finocchiaro2003dsmm][van2010dynameomics][meyer2010model][lemson2006halo][riebe2013multidark][wolstencroft2011seek]. We developed WholeCellSimDB, a hybrid relational/HDF5 database, to organize, search, and share WC simulation results [karr2014wholecellsimdb]. WholeCellSimDB uses HDF5 to store simulation results and a relational database to store their metadata. This enables WholeCellSimDB to efficiently store simulation results, quickly search simulations by their metadata, and quickly retrieve slices of simulation results. WholeCellSimDB providers uses two interfaces to deposit simulation results; a web-based interface to search, browse, and visualize simulation results; and a JSON web service to retrieve simulation results. However, further work is needed to scale WholeCellSimDB to larger models and to develop tools for quickly searching WholeCellSimDB.

1.4.2.13. Simulation results analysis

Several tools have been developed to analyze and visualize simulation results. The most popular simulation software programs, including COPASI [bergmann2017copasi], E-Cell [dhar2006cell], and Virtual Cell [resasco2012virtual], provide basic tools for visualizing simulation results. Tools such as Escher [king2015escher] and Pathway Tools Omics Viewer [paley2006pathway] can also be used to visualize simulation results.

We developed WholeCellViz to visualize WC simulation results in their biological context [lee2013wholecellviz]. WholeCellViz provides users time series plots and interactive animations to visualize model predictions, and enables users to arrange grids of plots and animations to help users compare predictions across multiple simulation runs and simulated conditions. However, further work is needed to scale WholeCellViz to larger models and to make it easier to incorporate new visualizations into WholeCellViz.

1.4.3. Models of individual pathways and model repositories

Since the 1950’s, researchers have been using the tools described above to model cells. This has led to numerous models that represent individual pathways. Here, we review our progress in modeling individual pathways, building repositories of cell models, and their utility for WC modeling.

1.4.3.1. Models of individual pathways

Over the past 30 years, researchers have developed a wide range of models of individual cellular pathways [chelliah2015biomodels] (Figure 1.4, Table 1.5). In particular, researchers have developed models of cell cycle regulation [sible2007mathematical]; circadian rhythms [goldbeter2002computational]; electrical signaling [herz2006modeling]; metabolism [swainston2016recon][agren2012reconstruction][uhlen2017pathology]; signaling pathways such as the JAK/STAT, NF-\(\kappa\)B, p53, and TGF\(\beta\) pathways [hughey2010computational]; transcriptional regulation [gerstein2012architecture], and multicellular processes such as developmental patterning [kondo2010reaction] and infection. However, many pathways have not been modeled at the scale of entire cells, including several well-studied pathways. For example, although we have extensive knowledge of the mutations responsible for cancer, we have few models of DNA repair; although we have extensive structural and catalytic information about RNA modification, we have few kinetic models of RNA modification; and although we have detailed atomistic models of protein folding, we have few cell-scale models of chaperone-mediated folding.

../_images/v21.png

Figure 1.4 WC models can be built by leveraging existing models of well-studied processes (colors) and developing new models of other processes (gray).

Table 1.5 Pathway distribution, computational representations, and taxonomic distribution of the models contained in the BioModels model repository (Chelliah et al., 2015).
  Number of models in BioModels, by kingdom Mean model size
Pathway Formalisms Viruses Eukaryotes Bacteria Unannotated Species Reactions Parameters
Cell cycle ODEs, SSA   44     14.0 19.4 33.6
Cell death ODEs   11   2 24.5 33.6 42.2
Circadian regulation ODEs   38   1 17.3 31.2 65.5
DNA repair ODEs   1     23.0 25.0 26.0
Electrical signaling ODEs   34   5 12.7 26.4 37.5
Gene expression regulation Boolean network   9 10 5 11.9 14.0 15.5
Host-pathogen interaction ODEs 1 2 1   24.3 44.5 58.0
Intracellular transport ODEs   2   2 7.8 12.8 16.3
Macromolecule modification ODEs   1   2 10.7 26.0 19.7
Metabolism FBA, ODEs   100 16 5 57.0 39.7 195.6
Motility ODEs, PDEs   2 2   40.8 48.3 79.5
Organismal process ODEs 1 66 2 2 17.2 20.1 48.8
Regulation, other ODEs   5   14 12.0 17.8 22.2
Signal transduction ODEs, SSA   144 3 30 35.3 54.1 67.8
Stress response ODEs   9     16.6 19.4 46.2

Collectively, these models span a broad range of scales. For example, although most of these models represent the chemical transformations responsible for each pathway, some of these models, such as most transcriptional regulation models, use coarser representations. As a second example, although most of these models represent temporal dynamics, most metabolic models only represent the steady-state behavior of metabolism [orth2010flux]. Similarly, although most of these models represent cells as well-mixed bags, some of these models represent the spatial distribution of individual compounds including nutrients and hormones [geitmann2009mechanics][huang2003dynamic][erickson2009modeling]. In addition, although most of these models represent the mean behavior of cells, averaged over multiple cells and cell cycle phases, a few of these models represent the temporal dynamics of the cell cycle and the variation among single cells.

Collectively, these models also use a wide range of computational representations and simulation algorithms. Many of these models are represented as reaction networks. However, some of the largest of these models must be represented using rules [harris2016bionetgen] or Boolean networks. Many of these models can be simulated by integrating ODEs. However, some of the largest models must be simulated using network-free methods [sneddon2011efficient], the steady-state metabolism models must be simulated with FBA [orth2010flux], some of the spatiotemporal models must be simulated by integrating PDEs, and some of the network models must be simulated by iteratively evaluating Boolean regulatory functions [karlebach2008modelling].

These pathway models could be used to help build WC models. However, substantial work would be required to integrate these models into a single model because these models describe different scales, make different assumptions, are represented using different mathematical formalisms, are calibrated to different organisms and conditions, and are represented using different identifiers and formats. To avoid needing to substantially revise pathway models for incorporation into WC models, modelers should build pathway models explicitly for integration into WC models. This requires the modeling community to embrace a common format, common identifiers, common units, and common standards for model calibration and validation.

1.4.4. Models of multiple pathways

Since 1999 when Tomita et al. reported one of the first models of multiple pathways of M. genitalium [tomita1999cell], researchers have been trying to build increasingly comprehensive models of multiple pathways. In particular, this has led to models of Escherichia coli and Saccharomyces cerevisiae which describe their metabolism and transcriptional regulation [covert2004integrating][chandrasekaran2010probabilistic]; their metabolism, signaling, and transcriptional regulation [covert2008integrating][lee2008dynamic][carrera2014integrative]; and their metabolism and RNA and protein synthesis and degradation [thiele2009genome]. Table 1.6 summarizes several recently published and proposed models of multiple pathways. Despite this progress, these models only represent a small number of pathways and a small number of organisms.

Table 1.6 Models of multiple cellular pathways and their computational representations.
Pathways Computational representation Species Status References
28 Chromosome Condensation, Chromosome Segregation, Cytokinesis, DNA damage, DNA repair, DNA supercoiling, FtsZ Polymerization, Host interaction, Macromolecular complexation, Metabolism, Protein activation, Protein decay, Protein folding, Protein modification, Protein processing I, Protein processing II, Protein translocation, Replication, Replication Initiation, Ribosome assembly, RNA decay, RNA modification, RNA processing, Terminal organelle assembly, Transcription, Transcriptional regulation, Translation, tRNA aminoacylation Hybrid: Boolean, flux balance analysis, ordinary differential equations, stochastic simulation Mycoplasma genitalium Published Karr et al., 2012
6 Metabolism, protein complexation, RNA maturation, RNA modification, transcription, translation Flux balance analysis Escherichia coli Published Thiele et al., 2009
5 Metabolism, protein degradation, RNA degradation, transcription, translation Ordinary differential equations Mycoplasma genitalium Published Tomita et al., 1999
3 Circadian rhythms, metabolism, transcriptional regulation Hybrid: flux balance analysis, ordinary differential equations Synechocystis sp. PCC 6803 Proposed Steuer et al., 2012
3 Contraction, electrical signling, metabolism Ordinary differential equations Homo sapiens Proposed Bassingthwaighte et al., 2005
3 Metabolism, signal transduction, transcriptional regulation Hybrid: Boolean, flux balance analysis, ordinary differential equations Escherichia coli Published Covert et al., 2008
3 Metabolism, signal transduction, transcriptional regulation Hybrid: constraint-based modeling, ordinary differential equations, phenomenological modeling Escherichia coli Published Carrera et al., 2014
3 Metabolism, signal transduction, transcriptional regulation Ordinary differential equations Saccharomyces cerevisiae Published Klipp et al., 2005
3 Metabolism, signal transduction, transcriptional regulation Hybrid: Boolean, flux balance analysis, ordinary differential equations Saccharomyces cerevisiae Published Lee et al., 2008
3 Metabolism, signal transduction, transcriptional regulation Hybrid: Boolean, flux balance analysis, ordinary differential equations N/A Review Gonçalves et al., 2013
2 Cell cycle regulation, metabolism Hybrid: Flux balance analysis, ordinary differential equations Saccharomyces cerevisiae Proposed Barberis et al., 2017
2 Cell cycle regulation, signal transduction Logical model Homo sapiens Published Huard et al., 2012
2 Contraction, electrical signling Ordinary differential equations Homo sapiens Published Greenstein et al., 2006
2 Metabolism, signal transduction Ordinary differential equations Homo sapiens Published König et al., 2012
2 Metabolism, signal transduction Ordinary differential equations Homo sapiens Published Mosca et al., 2012
2 Metabolism, transcriptional regulation Hybrid: Boolean, flux balance analysis Escherichia coli Published Covert et al., 2004
2 Metabolism, transcriptional regulation Hybrid: Bayesian, flux balance analysis Escherichia coli Published Chandrasekaran and Price, 2010
2 Metabolism, transcriptional regulation Hybrid: Boolean, flux balance analysis Escherichia coli Published Shlomi et al., 2007
2 Electrical signaling, tension development Ordinary differential equations Homo sapiens Published Niederer and Smith, 2007
2 Signal transduction, transcriptional regulation Ordinary differential equations Homo sapiens Published Nakakuki et al., 2010
2 Signal transduction, transcriptional regulation Ordinary differential equations Homo sapiens Published Stelniec-Klotz et al., 2012
2 Metabolism, transcriptional regulation Hybrid: Bayesian, flux balance analysis Mycobacterium tuberculosis Published Chandrasekaran and Price, 2010
2 Metabolism, transcriptional regulation Hybrid: Bayesian, flux balance analysis Mycobacterium tuberculosis Published Ma et al., 2015

To represent multiple pathways, most of these models have been developed by combining separate submodels of each pathway, using the most appropriate mathematical representation for each pathway. This has led to multi-algorithmic models which must be simulated by co-simulating the individual submodels. Because there are few multi-algorithmic simulation tools and most of these models only combine two or three submodels, the developers of most of these models have developed ad hoc methods to simulate their models. For example, Covert et al. developed an ad hoc method to simulate their hybrid dynamic FBA / Boolean model of the metabolism and transcriptional regulation of E. coli [covert2004integrating] and Chandrasekaran and Price developed a different ad hoc method to simulate their hybrid FBA / Bayesian model of the metabolism and transcriptional regulation of E. coli [chandrasekaran2010probabilistic]. Because there are few tools for working with such integrative models, these models have also been described with different ad hoc formats and identifiers, simulated with different ad hoc simulation software programs, and calibrated and validated with different ad hoc methods.

1.4.4.1. Model repositories

Several model repositories, including BioModels [chelliah2015biomodels] and the Physiome Model Repository [yu2011physiome], have been developed to make it easy to find models (Table 1.7). However, only a few of these repositories support integrated models; most of these repositories only support a limited number of model formats; many reported models are never deposited to any model repository; many of the models that are deposited are not sufficiently annotated for other researchers to understand, reuse, and extend the models; and only a few of the repositories also support the information needed to simulate models such as parameter values.

Table 1.7 Repositories that contain published models that can be modified, extended, and combined to create WC models.
Repository Content URL Reference
BiGG Repository for constraint-based models of metabolism http://bigg.ucsd.edu King et al., 2016
BioModels Repository for SBML-encoded models that contains many cell cycle, circadian, electrical signaling, metabolism, and signal transduction models http://www.ebi.ac.uk/biomodels-main Chelliah et al., 2015
FigShare Repository for supplemental materials that contains some models https://figshare.com  
GitHub Repository for code that contains some models https://github.com  
JWS Online Online environment for systems biology modeling that includes a model repository http://jjj.biochem.sun.ac.za Peters et al., 2017
Open Source Brain Repository for NeuroML-encoded models of neurophysiology http://www.opensourcebrain.org Gleeson et al., 2012
Physisome Repository Repository for CellML-encoded models that contains physiological models https://models.physiomeproject.org Yu et al., 2011
SimTK Repository for data and code that contains several biomechanics models https://simtk.org  

1.5. Emerging principles and methods for WC modeling

In the previous section, we outlined the ongoing technological advances that are making WC modeling feasible. Here, we propose several principles for WC modeling and describe how we and others are adapting and integrating these technologies into a methodology for WC modeling. In the following sections, we outline the major remaining bottlenecks to WC modeling, highlight ongoing efforts to overcome these bottlenecks, and describe how we are beginning to use this methodology to build WC models.

1.5.1. Principles of WC modeling

Based on our experience, we propose several guiding principles for WC modeling (Figure 1.5).

  • Modular modeling. Similar to other large engineered systems such as software, WC models should be built by partitioning cells into pathways, outlining the interfaces among these pathways, building submodels of each pathway, and combining these submodels into a single model. This approach reduces the dimensionality of model construction, calibration, and validation and facilitates collaborative modeling.
  • Multi-algorithmic simulation. Furthermore, to capture both well- and poorly-characterized pathways, each pathway should be represented using the most appropriate mathematical representation given our knowledge and data about each pathway. In particular, multi-algorithmic simulation should be used to create identifiable models which can be calibrated from our experimental data.
  • Experimental calibration and validation. WC models should be rigorously calibrated and extensively validated via comparison to detailed experimental data across a wide range of molecular mechanisms, phenotypes, and scales.
  • Systemization and standards. To scale modeling to entire cells and facilitate collaboration, we should systemize every aspect of dynamical modeling, develop standards for describing WC models and standard protocols for validating and merging model components, and encourage researchers to embrace these standard protocols and formats.
  • Technology development. To enable WC modeling, we must develop technologies for systematically and scalably building, calibrating, simulating, and validating WC models. These technologies should be modular to facilitate collaborative technology development and integrated into a unified framework to provide modelers user-friendly modeling and simulation tools.
  • Leverage existing methods and data. Where possible, WC modeling should take advantage of existing computational methods and experimental data. For example, WC modeling should take advantage of parallel simulation methods developed by computer science and WC models should be built, in large part, from data aggregated from public repositories.
  • Focus on critical problems and clear, achievable goals. To maximize our efforts, we should periodically identify the key bottlenecks to WC modeling and periodically refocus our efforts on overcoming these bottlenecks. Based on lessons learned from other “big science” projects [hilgartner2013constituting][collins2003human], we should also delineate clear goals and clearly define the responsibilities of each researcher.
  • Focus on model organisms. To facilitate collaboration, early WC modeling efforts should focus on a small number of organisms and cell lines that are easy to culture, well-characterized, karyotypically and phenotypically “normal”, genomically stable and relevant to a wide range of basic science, medicine, and bioengineering. This includes well-characterized bacteria such as Escherichia coli and well-characterized human cell lines such as the H1 human embryonic stem cell (hESC) line.
  • Reproducibility, transparency, extensibility, and openness. To facilitate collaboration and maximize impact, WC models and simulations should be reproducible, comprehensible, and extensible. For example, to enable other modelers to understand a model, the biological semantic meaning of each species and reaction should be annotated, the data sources and assumptions used to design the model should be annotated, and the parameter values used to produce each simulation result should be recorded. Furthermore, each WC model and WC modeling technology should be free and open-source.
  • Constant innovation. Because we do not yet know exactly what WC models should represent, what WC models should predict, or how to build WC models, we should periodically evaluate the quality of our models and methods and iteratively improve our models and methods as we learn more about cell biology and WC modeling. This should include how we partition cells into pathways, the interfaces that we define among the pathways, and how we simulate multi-algorithmic models.
  • Interdisciplinary collaboration. WC modeling should be an interdisciplinary collaboration among modelers, experimentalists, computer scientists, and engineers, and research sponsors. Furthermore, there should be open and frequent communication among the WC modeling community.
../_images/v11.png

Figure 1.5 Principles of WC modeling.

1.5.2. Methods for WC modeling

To enable WC models, we and others are adapting and integrating the technologies described in Section 1.4.2 into a workflow for scalably building, simulating, and validating WC models (Figure 1.6). (1) Modelers will use Datanator to aggregate, standardize, and integrate the experimental data that they will need to build, calibrate, and validate their model into a single dataset. (2) Modelers will use this data to design submodels of each individual pathway using the most appropriate mathematical representation for each pathway, and encode their model in wc_rules, a rule-based format for describing WC models. (3) Modelers will construct reduced models, and use them to calibrate each submodel and their entire model. (4) Modelers will use formal verification and/or unit testing to verify that their model functions as intended and recapitulates the data used to build the model. (5) Modelers will use wc_sim, a scalable, network-free, multi-algorithmic simulator, to simulate their model. (6) Modelers will use WholeCellSimDB to organize their simulation results and use WholeCellViz to visually analyze these results. Importantly, every tool in this workflow will facilitate collaboration to help researchers work together, and these tools will be modular to enable us and others to continuously improve this methodology. We plan to implement this workflow by leveraging recent advances in computational and experimental technology (Section 1.4). Here, we describe the six steps of this emerging workflow.

../_images/v2.png

Figure 1.6 Emerging workflow for scalably building, simulating, and validating WC models. (a) Modelers will aggregate the data for WC modeling into a single dataset. (b) Modelers will use this data to design multi-algorithmic WC models. (c,d) Modelers will use reduced models to calibrate, verify, and validate models. (e) Modelers will simulate multi-algorithmic WC models by co-simulating their submodels. (f) Modelers will visualize and analyze their results to discover new biology, personalize medicine, and design microorganisms.

1.5.2.1. Data aggregation, standardization, and integration

The first step of WC modeling is to aggregate, standardize, integrate, and select the experimental data needed for WC modeling into a single dataset for model building, calibration, and validation (Figure 1.6a).

First, we must aggregate a wide range of experimental data from a wide range of databases such as such as biochemical data about metabolite concentrations from ECMDB [sajed2016ecmdb], RNA-seq data about RNA concentrations from ArrayExpress [kolesnikov2015arrayexpress], and mass-spectrometry data about metabolite concentrations from PaxDb [wang2015version]. Where possible, data should be aggregated using database downloads and web services. Otherwise, data should be aggregated by scraping webpages. In addition to aggregating data from databases, we should also aggregate data from collaborators, individual publications, and bioinformatics prediction tools such as PSORTb [yu2010psortb] and TargetScan [agarwal2015predicting].

To the extent possible, we should record the provenance of this data including the biosample (e.g., species, strain, genetic variants) and environmental conditions (e.g., temperature, pH, growth media) that were measured, the experimental method used to generate the data, the computational method used to analyze the data, and the citation for original data to help us select the most relevant data for modeling and trace models back to their data sources.

Second, we must standardize the identifiers and units used to describe this data. For example, metabolites should be identified using the IUPAC International Chemical Identifier (InChI) format [heller2013inchi] and RNA should be identified by their genomic coordinates. Similarly, all units should be standardized to SI units or combinations of SI units.

Third, we must integrate this data by linking the data together through common metabolites, chromosomes, RNA, proteins, and interactions. To enable this data to be quickly searched and explored, this data should be organized into a relational database.

Fourth, we must identify the most relevant data within our database for the species and environmental condition that we want to model. For each experimental measurement that we need to constrain a model, we must search our database for data observed for similar biology (e.g., metabolites, RNA, proteins, and interactions), genotypes (e.g., species, strain, and genetic variants), and environmental conditions (e.g., temperature, pH, growth media); calculate the relevance of each experimental observation; and calculate the consensus of the relevant observations, weighted by their relevance.

Fifth, we should organize these consensus experimental values and their provenance (experimental evidence and the method used to calculate the consensus value) into a single dataset. Pathway/genome databases (PGDB) can be used to organize this information because PGDBs are well-suited to representing relationships among experimental data about a single species. We have developed the WholeCellKB PGDB to organize the data needed for WC modeling. WholeCellKB provides users three interfaces to deposit experimental data for WC models, extensive functionality for validating this data, a web-based user interface to search and browse this data, and a JSON web service to programmatically retrieve data for model construction.

1.5.2.2. Model design

The second step of WC modeling is to use the data aggregated in the first step to design models, including each species and interaction (Figure 1.6b). To represent the details of well-characterized pathways, as well as coarsely represent poorly-characterized pathways, WC models should be built by partitioning cells into pathways, modeling each pathway using the most appropriate mathematical representation, and combining pathway submodels into composite, multi-algorithmic models.

To capture the large number of possible cellular phenotypes, WC models should also capture the combinatorial complexity of cellular biochemistry. For example, WC models should represent the combinatorial number of RNA transcripts that can be produced from the interactions of transcription, RNA editing, RNA folding, and RNA degradation; WC models should represent the combinatorial number of possible interactions among the subunits of protein complexes; and the combinatorial number of phosphorylation states of each protein complex.

To generate accurate predictions, WC models should also aim to represent the aggregate physiology of poorly understood biology such as uncharacterized genes, uncharacterized small peptides, and uncharacterized non-coding RNA. This can be accomplished by including lumped reactions that represent the aggregate physiology of all unknown biology. For example, to accurately predict metabolic reaction fluxes, like FBA models, WC models can include reactions that capture the aggregate energy usage of all uncharacterized interactions.

To scalably and reproducibly build WC models, WC models should be programmatically built from PGDBs using scripting tools such as PySB [lopez2013programming].

Because WC models will never be complete, WC models should be built by designing an initial model and then iteratively improving the model until the model accurately predicts new experimental measurements. In particular, WC models can be systematically refined by identifying gaps between their bottom-up descriptions of cellular biochemistry and our physiological knowledge, searching for reactions and gene products that might fill those gaps, and parsimoniously adding species and reactions to models so they recapitulate experimental observations. Model selection methods can also be used to select among multiple potential model designs. Furthermore, version control systems such as Git [git2017] should be used to track model changes and enable collaborators to refine models in parallel and merge their refined models.

To enable other researchers to reproduce, understand, reuse, and extend WC models, WC models should be encoded in rule-based formats such as BioNetGen and extensively annotated. In particular, rule-based formats enable researchers to concisely describe the combinatorial complexity of cell biology. Model annotations should include semantic annotations about the biological meaning of each species and interaction such as the chemical structure of each metabolite in InChI format [heller2013inchi] and provenance annotations about the data sources, assumptions, and design decisions behind each species, interaction, and pathway.

1.5.2.3. Model calibration

The third step in WC modeling is to calibrate model parameters (Figure 1.6c). This should be done by using numerical optimization methods to minimize the distance between the model’s predictions and related experimental observations. One promising method for calibrating composite WC models is to (a) use multi-algorithmic modeling to only create parameters whose values can be constrained by one or a small number of experimental measurements, (b) estimate the value of each individual parameter using one or a small number of experimental observations, (c) construct a set of reduced models, one for each submodel, to estimate the joint values of the parameters, and (d) use distributed global optimization tools such as saCeSS [penas2017parameter] to refine the joint values of the parameters [karr2015summary]. This method avoids the need to calibrate large numbers of parameters of physiological data; performs the majority of model calibration using low dimensional models of individual species, reactions, and pathways; and generates successively better starting points for more refined calibration.

1.5.2.4. Model verification and validation

The fourth step in WC modeling is to verify that models behave as intended and validate that models recapitulate the true biology (Figure 1.6d). First, WC should be verified models using a series of increasingly comprehensive unit tests that test each individual species, reaction, and pathway, as well as groups of pathways and entire models. Importantly, these tests should cover all of the logic of the model. For example, these tests should test the edge cases of every rate law. Reduced models should be used to efficiently test individual species, reactions, and pathway submodels. Furthermore, to quickly identify errors, continuous integration systems such as Jenkins [jenkins2017] should be used to automatically execute tests each time models are revised. Alternatively, models can be verified using formal verification systems such as PRISM [kwiatkowska2011prism]. However, substantial work remains to adapt formal verification to multi-algorithmic dynamical modeling.

Second, WC models should be validated by comparing their simulation results to independent experimental data that was not used for model construction or calibration. To be effective, models should be tested using a broad range of data that spans different types of predictions, genetic perturbations, and environmental conditions.

Third, because it is infeasible to validate possible model prediction, modelers should annotate how models were validated to help other modelers know which model predictions can be trusted, know which predictions still need to be validated, and reuse the validation data to validate improved and/or extended models. These annotations should include which data were used for validation, which predictions were validated, and how well the model recapitulated each experimental observation. We believe that this metadata will be critical for medicine where therapy should only be driven by validated model predictions.

1.5.2.5. Network-free multi-algorithmic simulation

The fifth step of WC modeling is to numerically simulate WC models (Figure 1.6e). Because WC models should be described using rules and composed of multiple mathematically-dissimilar submodels, WC models simulated by co-simulating their submodels. This can be achieved in three steps. First, all of the submodels should be converted to explicit time-driven submodels. For example, Boolean submodels should be converted to SSA submodels by assuming typical concentrations and kinetic rates. Second, all of the mathematically-similar submodels should be analytically merged into a single mathematically-equivalent submodel. Third, for WC models that are composed only of FBA, ODE, and ODE submodels, (a) the SSA submodel should be used as the master clock for the integration and synchronization of the submodels, (b) each time the SSA submodel advances to the next iteration, the FBA and ODE submodels should be synchronized with the SSA submodel and integrated for the same timestep as the SSA submodel, (c) and the SSA submodel should be synchronized with the FBA and ODE models. If the FBA or ODE models generate unphysical states such as negative concentrations, they must be rolled back and reintegrated for multiple smaller timesteps. To efficiently simulate WC models, the FBA and ODE models should only be evaluated periodically.

To efficiently simulate the combinatorial complexity represented by WC models, most submodels should be simulated using SSA and SSA should be implemented using network-free graph-based methods. Specifically, SSA should be implemented by representing each molecule as a graph, representing each reaction rule as a graph, searching for matching pairs of species-reaction graphs to determine the rate of each reaction, randomly selecting a reaction to fire, updating the species involved in the selected reaction, and using a species-reaction dependency graph to update the rates of all affected reactions. This methodology will enable WC simulations to scale to large numbers of possible species and reactions by only representing the configuration of each active molecule rather than representing the copy number of each possible species.

To simulate WC models quickly, WC models should be simulated using a distributed simulation framework such as parallel discrete event simulation (PDES) and partitioning WC models into cliques of tightly connected species and reactions.

To make WC simulations comprehensible and reproducible, WC simulations should be represented using a common format such as SED-ML or SESSL.

1.5.2.6. Visualization and analysis of simulation results

The sixth step of WC modeling is to visualize and analyze WC simulation results to discover new biology, personalize medicine, or design microbial genomes (Figure 1.6f). First, all of the metadata needed to understand and reproduce simulation results should be recorded, including the model, the version of the model, the parameter values, and the random number generator seed that was simulated. Second, simulation results should be logged and stored in HDF5 format [folk2011overview]. Third, WC simulation results and their metadata should be organized using a tool such as WholeCellSimDB that helps researchers search, slice, reduce, and share simulation results. Fourth, researchers should use tools such as WholeCellViz to visually analyze WC simulation results and use visualization grammars such as Vega [satyanarayan2017vega] to develop custom diagrams.

1.6. Latest WC models and their limitations

Because it is not yet possible to completely model a cell, researchers are pursuing several complementary approaches to modeling entire cells. Historically, researchers such as Michael Shuler focused on building coarse-grained models of the major functions of cells [atlas2008incorporating][shuler1979mathematical]. Over the last ten years, researchers have begun to leverage the growing wealth of experimental data and our increasing computational power to build fine-grained models of the molecular biology of entire cells. This includes bottom-up efforts to represent the contribution of each gene to cellular behavior starting from genome sequences and annotations [karr2012whole], top-down efforts to represent the integrated behavior of each cellular process, and bottom-up efforts to model diffusion at the cell scale [roberts2014cellular][hallock2014simulation][roberts2009long]. More recently, researchers have begun to merge these fine-grained approaches. For example, Schulten recently demonstrated a hybrid FBA-diffusion model of E. coli [cole2015spatially]. Here, we describe recent progress in each of these major approaches to WC modeling.

1.6.1. Coarse-grained models

In adddition fine-grained models, researchers have also developed several coarse-grained models of multiple cellular processes [atlas2008incorporating][shuler1979mathematical]. These models could be used to help inform the global structure and mathematical behavior of WC models. However, they generally cannot be directly incorporated into WC models because they use coarse-grained representations that are incompatible with that of fine-grained WC models.

1.6.2. Genomically-centric bottom-up fine-grained models

Toward WC models, recently, we and others demonstrated the first model which represents every characterized gene function of a cell [karr2012whole] (Figure 1.7a). The model represents 28 pathways of M. genitalium. The model was developed by annotating the M. genitalium genome, reconstructing the species encoded by each gene and the reactions catalyzed by each gene using data from over 900 databases and publications, partitioning the species and reactions into 28 pathways, developing separate submodels of each pathway, and integrating the submodels into a single model. To help us organize the data used to build the model, we developed WholeCellKB, a pathway/genome database (PGDB) software system tailored for WC modeling [karr2013wholecellkb], and developed scripts to generate the model from the PGDB.

../_images/v12.png

Figure 1.7 A WC model of M. genitalium predicts high-level cellular behaviors from the molecular level. (a) The model combines multiple submodels of individual cellular subsystems. We validated the model by comparing its outputs to experimental data which describes its rate of growth (b) and RNA polymerase occupancy (c). We have used the model to understand how cells regulate their cell cycle (d) and allocate energy (e).

To capture our varying level of knowledge about each pathway, we described each pathway using the most appropriate mathematical representation. For example, we represented transcription and translation as stochastic models, represented metabolism using FBA, and represented cell division with ODEs. We combined the submodels into a single model by mapping their inputs and outputs onto a common set of global variables that we formed by taking the union of the state variables of the individual submodels.

We developed a novel algorithm to simulate the combined model by co-simulating the submodels. The algorithm co-simulated the submodels by partitioning the copy number variables into separate pools for each submodel proportional to their anticipated consumption, iteratively integrating the submodels, updating the global variables by merging the pools associated with the submodels, and updating all other state variables. To help us analyze the model’s simulation results, we also developed WholeCellSimDB, a database for organizing, storing, and sharing WC simulation results [karr2014wholecellsimdb] and WholecellViz, a web-based software tool for visualizing high-dimensional WC simulation results in their biological context [lee2013wholecellviz].

We calibrated the model by constructing a set of reduced models that focused on each pathway submodel, calibrating the individual submodels, and using the parameter values learn from calibrating the individual submodels as a starting point for calibrating the entire model [karr2015summary].

We validated the model by constructing numerous reduced models that focused on individual submodels and groups of submodels, checking that the submodels and groups of submodels are consistent with our knowledge such as the Central Dogma, and checking that the submodels and groups of submodels are consistent with the experimental data that we used to build the model and additional independent experimental data (Figure 1.7b,c). In particular, we demonstrated that the model recapitulates the observed M. genitalium growth rate and predicts the essentiality of each gene with 80% accuracy.

In addition, we have used the model to demonstrate how WC models could be used to help design synthetic circuits [purcell2013towards] and we have used the model to demonstrate how WC models could help reposition antibiotics among distance bacteria [kazakiewicz2015combined].

Despite this progress, the model does not represent several important cell functions such as the maintenance of electrochemical gradients across the cell membrane, and the model mispredicts several important phenotypes such as the growth rates of many single-gene deletion strains. Furthermore, the model took over 10 person-years to construct because it was largely built by hand; the model is difficult to understand, reuse, and extend because it was described directly in terms of its numerical simulation rather than using a high-level format such as SBML; the model’s simulation software is not reusable because it was built to simulate a single model; the model’s simulation algorithm violates the arrow of time and is unscalable because it only partitions a portions of the state variables among the submodels.

1.6.3. Physiologically-centric top-down fine-grained models

In parallel, researchers such as Edda Klipp are taking a complementary top-down physiologically-centric approach to WC modeling to our genomically-centric bottom-up approach to WC modeling. In contrast to our approach which starts from annotated genomes, Edda Klipp and her colleagues are modeling entire cells by enumerating the major processes present in cells, developing submodels of each process, and combining the submodels into a single model.

1.6.4. Spatially-centric bottom-up fine-grained models

In parallel, researchers such as Elijah Roberts and Zaida Luthy-Schulten are taking another complementary spatially-centric approach to WC modeling [roberts2014cellular][hallock2014simulation][roberts2009long]. This approach focuses on representing the spatial distribution and diffusion of each molecular species, and uses molecular dynamics simulation methods to predict their spatiotemporal dynamics. However, because it is computationally expensive to simulate diffusion on the scale of entire cells, this approach is currently limited to second-scale simulations.

1.6.5. Hybrid models

As introduced above, Zaida Luthey-Schulten and her collegues have begun to merge these fine-grained approaches to WC modeling by combining a diffusion model with an FBA model [cole2015spatially].

1.7. Bottlenecks to more comprehensive and predictive WC models

In the previous sections, we described how we and others are beginning to build WC models. Despite this progress, it is still challenging to build and simulate WC models. To help focus the community’s efforts to accelerate WC modeling, here, we summarize the major remaining bottlenecks to WC modeling (Figure 1.8). These bottlenecks are based on our own experience and a community survey of the bottlenecks to biomodeling that we conducted in 2017 [szigeti2018blueprint]. In the following sections, we suggest ways to overcome these bottlenecks.

../_images/v5.png

Figure 1.8 Major bottlenecks to WC modeling and the major methods, tools, and resources needed to advance WC modeling.

1.7.1. Inadequate experimental methods and data repositories

In our opinion, one of the biggest bottlenecks to WC modeling is collecting and aggregating enough high-quality experimental data to build WC models. This is a significant bottleneck because WC models require extensive data, and because, as described in Section 1.3.1, we do not yet have sufficient methods for characterizing cells, sufficient tools for annotating the semantic meaning of experimental data, sufficient repositories for aggregating and integrating experimental data, and sufficient incentives for researchers to share their data.

New measurement methods, data repositories, and data aggregation tools are needed to overcome this bottleneck: (a) improved proteome-wide methods for measuring protein abundances would facilitate more accurate models of many pathways; (b) improved metabolome-wide methods for measuring metabolite concentrations would enable more accurate models of metabolism; (c) new single-cell measurement methods would facilitate more accurate models of the phenotypic variation of single cells; (d) a new central data repository that uses consistent representations, identifiers, and units would accelerate data aggregation [howe2008big]; and (e) new tools for searching this repository would help researchers identify relevant data for WC modeling, including data from related organisms and environments.

1.7.2. Incomplete, inconsistent, scattered, and poorly annotated pathway models

As discussed in Section 1.5, the most promising strategy for building WC models is to combine multiple separate models. However, the lack of a complete set of compatible, well-annotated, and high-quality pathway models is a major bottleneck to WC modeling [krause2009annotation][neal2014reappraisal][snoep2006towards][gonccalves2013bridging]. Here, we summarize the limitations of our pathway models.

1.7.2.1. Incomplete models

Despite decades of modeling research and detailed models of several pathways, we still do not have models of most pathways. For example, we do not have models of the numerous DNA repair mechanisms, the mechanisms responsible for RNA editing, or the role of chaperones in protein folding.

1.7.2.2. Poorly validated and unreliable models

Many of our existing pathway models are insufficiently validated and reliable to be effective components of WC models. Furthermore, few models are published with sufficient information about what data was used to validate the model, which simulation predictions were validated, and which simulation predictions are reliable for other researchers to know the limitations of a model and how to properly reuse it.

1.7.2.3. Inconsistent models

Furthermore, many of our existing pathway models are inconsistent. In particular, many of existing models are described with different assumptions, granularities, mathematical representations, identifiers, units, and formats.

1.7.2.4. Unpublished and scattered models

Unfortunately, our published models are scattered across a large number of resources, including model repositories such as BioModels, Simtk, supplementary materials, GitHub, and individual lab web pages, and many reported models are never published.

1.7.2.5. Incompletely annotated models

Many reported models are also not sufficiently well-annotated to combine them into WC models. For example, the biological semantic meaning of a model is often not annotated. This makes it difficult for other researchers to understand the meaning of each variable and equation which, in turn, makes it difficult for other researchers to merge models. The provenance of a model is also rarely annotated. This makes it difficult for other researchers to understand how a model was calibrated, recalibrate the model to represent a different organism and/or condition, and merge a model with models of other organisms and/or conditions. In addition, the assumptions of a model are also rarely annotated. Similarly, this makes it difficult for other researchers to understand how a model was developed, revise a model to represent other organisms and conditions, and merge models from different organisms and conditions.

1.7.3. Inadequate software tools for WC modeling

As described in Section 1.4, a wide range of tools have been developed for modeling individual pathways. However, few of these tools support all of the features needed for WC modeling. In particular, few of these tools support the scale required for WC modeling, few of these tools support composite, multi-algorithmic modeling, few of these tools support collaboration, and these tools do not support all of the metadata needed to understand models and their provenance.

1.7.4. Inadequate model formats

As described in Section 1.4.2.5, several formats have been developed to describe cell models. However, the lack of a format that supports all of the features needed for WC modeling is a major bottleneck. In particular, no existing format can represent (a) the combinatorial complexity of pathways such as transcription elongation which involve billions of sequence-based reactions; (b) the multiple scales that must be represented by WC models such as the sequence of each protein, the subunit composition of each complex, and the DNA binding of each complex; and (c) multi-algorithmic models that are composed of multiple mathematically-distinct submodels [waltemath2016toward].

1.7.5. Lack of coordination among the cell modeling community

Another major bottleneck to WC modeling is the lack of coordination among the cell modeling community. Currently, the lack of coordination leads modelers to build competing models of the same pathways and describe models with inconsistent identifiers and formats.

1.8. Technologies needed to advance WC modeling

In the previous section, we outlined the major remaining bottlenecks to WC modeling. To overcome these bottlenecks, we must develop a wide range of computational and experimental technologies. Here, we describe the most critically needed technologies to advance WC modeling. In the following sections, we highlight our and others’ ongoing efforts to develop these technologies.

1.8.1. Experimental methods for characterizing cells

While substantial data about cellular populations already exists, additional data would enable better WC models. In particular, we should develop new experimental methods for quantitating the dynamics and single-cell variation of each metabolite and protein. Additionally, we should develop methods for measuring kinetic parameters at the interactome scale, as well as methods for measuring cellular phenotypes across multiple genetic and environmental conditions.

1.8.2. Tools for aggregating, standardizing, and integrating heterogeneous data

As described in Section 1.4.1.1-1.4.1.2, extensive data is now available for WC modeling. However, this data spans a wide range of data types, organisms, and environments; the data is often not annotated and normalized; it is scattered across many repositories and publications and it is described using inconsistent identifiers and units. To make this data more usable for modeling, we must develop tools for aggregating data from multiple sources; merging data from multiple specimens, environmental conditions, and experimental procedures; standardizing data to common identifiers and units; identifying the most relevant data for a model; and averaging across multiple imprecise and noisy observations.

1.8.3. Tools for scalably designing models from large datasets

To scalably build WC models, we must develop tools for defining the interfaces among pathway submodels, collaboratively designing composite, multi-algorithmic models directly from large datasets, automatically identifying inconsistencies and gaps in dynamical models, recording how data and assumptions are used to build models, and encoding models in a rule-based format. As described in Section 1.4.2.2-1.4.2.4, several tools support each of these features. To accelerate WC modeling, we should develop a single tool that supports all of these functions at the scale required for WC modeling.

1.8.4. Rule-based format for representing models

Several formats can represent individual biological processes. However, no existing format is well-suited to representing the scale or mathematical diversity required for WC modeling [waltemath2016toward][medley2016guidelines]. To succinctly represent WC models, we should develop a rule-based format that can (a) represent models in terms of high-level biological constructs such as DNA, RNA, and proteins; (b) represent each molecular species at multiple levels of granularity (for example, as a single species, as a set of sites, and as a sequence); (c) represent all of the combinatorial complexity of molecular biology including the complexity of interactions among protein sites, as well as the complexity of protein-metabolite, protein-DNA, and protein-RNA interactions and the complexity of template-based polymerization reactions such as the combinatorial number of RNA than arise from the interaction of RNA splicing, editing, and mutations; (d) represent composite, multi-algorithmic models; (e) represent the biological semantic meaning of each species and interaction using database-independent formats such as InChI [heller2013inchi] and DNA, RNA, and protein sequences; and (f) represent model provenance including the data and assumptions used to build models.

1.8.5. Scalable network-free, multi-algorithmic simulator

To simultaneously represent well-characterized pathways with fine detail and coarsely represent poorly-characterized pathways, WC modeling requires a multi-algorithmic simulator that can scalably co-simulate mathematically-dissimilar submodels that are described using rule patterns. However, no existing simulator supports network-free, multi-algorithmic, and parallel simulation. To scalably simulate WC models, we should develop a parallel, network-free, multi-algorithmic simulator [goldberg2016toward]. At a minimum, the simulator should support FBA, ODE integration, and stochastic simulation.

1.8.6. Scalable tools for calibrating models

As discussed in Section 1.4.2.9, several tools are available for calibrating small single-algorithm models. However, these tools are not well-suited to calibrating large multi-algorithmic models. To calibrate WC models, we must develop new methods and software tools for scalably calibrating rule-based multi-algorithmic models. We and others have begun to explore using reduced models to efficiently calibrate WC models [karr2015summary]. However, further work is needed to formalize these methods, including developing automated methods for reducing WC models.

1.8.7. Scalable tools for verifying models

To fulfill our vision of using WC models to drive medicine and bioengineering, it will be critical for modelers to rigorously verify that WC models function as intended. As discussed in Section 1.4.2.10, researchers are beginning to adapt tools from computer science and software engineering to verify cell models. However, none of the existing or planned tools support rule-based, multi-algorithmic models. To help modelers verify WC models, we must adapt formal verification and/or unit testing for WC modeling. Furthermore, to help researchers quickly verify models, these tools should help researchers verify entire WC models, as well as help researchers verify reduced models and individual submodels.

1.8.8. Additional tools that would help accelerate WC modeling

In addition to these essential tools, we believe that WC modeling would also be accelerated by additional tools for annotating and imputing data, additional tools for sharing WC models and simulation results, additional tools for visualizing simulation results, and community standards for designing, annotating, and verifying WC models.

  • Tools and standards for annotating data. To make our experimental more useful for modeling, we should develop software tools that help researchers annotate their data and encourage experimentalists to use these tools to annotate their data.
  • Bioinformatics prediction tools. While existing bioinformatics tools can predict many properties of metabolites, DNA, RNA, and proteins, additional tools are needed to accurately predict the molecular effects of insertions, deletions, and structural variants. Such tools would help WC models design microbial genomes and predict the phenotypes of individual patients.
  • Repositories for WC models. To help researchers share whole-cell models, BioModels and other model repositories should be extended to support WC models. In addition, these repositories should be extended to support provenance metadata, validation metadata, simulation experiments, and simulation results.
  • Version control system for WC models. To help researchers collaboratively develop WC models, we should develop a version control system for tracking the changes to WC models contributed by individual collaborators and merging WC model components developed by collaborators. This system could be developed by combining Git [git2017] with a custom program for differencing WC models.
  • Simulation format. SED-ML and SESSL can represent simulations of models that are encoded in XML-based formats such as SBML and Java-based formats such as ML-Rules. However, neither is well-suited to representing simulations of models that are encoded in other formats such as BioNetGen. To accelerate WC modeling, we should extend SED-ML to support non-XML-based models or extend SESSL to support other programming languages such as Python and C++.
  • Database for organizing simulation results. We and others have begun to develop tools for organizing simulation results. However, these tools have limited functionality. To help researchers analyze WC simulation results, we must develop an improved database for simulation results that helps researchers quickly search simulation results for specific features and quickly retrieve specific slices of large simulation results datasets. This database should be implemented using a distributed database and/or data processing technologies such as Apache Spark.
  • Tools for visualizing simulation results. We and others have also begun to develop tools for visualizing high-dimensional simulation results. However, these tools have limited functionality, they are not easily extensible, and they struggle to handle large datasets. To help researchers analyze WC models to gain new biological insights, we must develop a new tool for visually exploring and analyzing WC simulation results. To enable researchers to incorporate new visual layouts, this tool should support a standard visualization grammar such as Vega [satyanarayan2017vega]. Furthermore, to handle terabyte-scale simulation result datasets, this tools should be implemented using a high-performance visualization toolkit such as VTK [vtk2017].
  • Community standards. To facilitate collaboration, we should develop guidelines for designing WC models, standards for annotating and verifying WC models, and a protocol for merging WC model components. The model design guidelines should describe the preferred granularity of WC model components and the preferred interfaces among WC model components. The standards for annotating and verifying WC models should describe the minimum acceptable semantic and provenance metadata for WC models. The protocol for merging WC model components should describe how to incorporate a new component into a WC model, how to test the new component and the merged model, and how to either accept the new component or reject the candidate component if it cannot be verified or is not properly annotated.

1.9. A plan for achieving comprehensive WC models as a community

In the previous sections, we described the potential of WC models to advance medicine and bioengineering, summarized the major bottlenecks to WC modeling, and outlined several technological solutions to these bottlenecks. To maximize our efforts to achieve WC models, we believe that we should begin to develop a plan for achieving WC models. Here, we propose a three-phase plan to achieve the first comprehensive WC model (Figure 1.9). The plan focuses on developing a WC model of H1-hESCs because we believe that the community should initially focus on a single cell line and because H1-hESCs are relatively easy to culture, well-characterized, karyotypically and phenotypically “normal”, genomically stable and relevant to a wide range of basic science, medicine, and bioengineering. Although the plan focuses on a single cell line, the methods and tools developed under the plan would be applicable to any organism, and the H1-hESC model could be contextualized to represent other cell lines, cell types, and individuals.

../_images/v7.png

Figure 1.9 The first WC models can be achieved in three phases: (1) demonstrating the feasibility of WC models by developing scalable modeling tools and using them to model several core processes, (2) demonstrating the feasibility of collaborative modeling by developing a collaborative modeling platform and using it to model additional processes, and (3) developing a comprehensive model as a community.

1.9.1. Phase I: Piloting the core technologies and concepts of WC modeling

Phase I should demonstrate the feasibility of WC models by developing the core technologies needed for WC modeling, and using these tools to build a model of a few critical pathways of H1-hEScs. First, we should develop tools for aggregating the data needed for WC modeling, tools for designing models directly from data, a rule-based format for describing models, tools for quickly simulating multi-algorithmic models, tools for efficiently calibrating and validating high-dimensional models, and tools for visualizing and analyzing high-dimensional simulation results. Second, a small group of researchers should use these tools and public data to build a model of the core pathways of H1-hEScs including several key signal transduction pathways, metabolism, DNA replication, transcription, translation, and RNA and protein degradation. Phase I should also begin to form a WC modeling community by organizing meetings and courses, developing WC modeling training materials, and discussing potential WC modeling standards.

1.9.2. Phase II: Piloting collaborative WC modeling

Phase II should focus on demonstrating the feasibility of collaborative WC modeling by developing collaborative modeling tools, and using them to expand the H1-hESc model begun in Phase I. First, we should combine the technologies developed in Phase I into a collaborative web-based WC modeling platform to enable multiple experts to build models together. Second, the community should develop standards for describing, validating, and merging submodels. Third, a modest consortium of modelers and experimentalists should expand the H1-hESc model developed in Phase I by partitioning H1-hESCs into distinct pathways, outlining the interfaces among these pathways, and tasking individual researchers with modeling additional pathways such as cell cycle regulation, DNA repair, and cell division. Fourth, we should extensively validate the combined model. Phase II should also continue to develop the fundamental technologies needed for WC modeling and continue to build a WC community by organizing meetings, courses, and other community events.

1.9.3. Phase III: Community modeling and model validation

Phase III should produce the first comprehensive WC model. First, we should assemble a large community of modelers and experimentalists and train them to use the platform developed in Phases I and II. Second, individual researchers should volunteer to model individual pathways and merge them into the global H1-hESc model. Third, we should continue to validate the combined model. Fourth, researchers should use the model to generate testable hypotheses to discover new biology, new disease mechanisms, and new drug targets. Fifth, we should also begin to develop methods for contextualizing the H1-hESC model to represent other cell lines, cell types, and individuals. In addition, the community should continue to develop the core technologies and standards needed for WC modeling, continue to refine the partitioning of cells into pathways, continue to refine the interfaces among the pathways, continue to organize meetings and course, and continue to develop WC modeling tutorials.

1.10. Ongoing efforts to advance WC modeling

In the previous section, we proposed a plan for achieving the first comprehensive WC model as a community. Although we do not yet have an organized WC modeling community, we and others are beginning to pilot WC models and the technology needed to achieve them. Here, we summarize the ongoing efforts to pioneer WC modeling.

1.10.1. Genomically-centric models

Currently, there are three genomically-centric WC models in development of Mycoplasma pneumoniae, E. coli, and H1-hESCs.

1.10.1.1. Mycoplasma pneumoniae

To explore how to build more comprehensive and more accurate models, we are working with Drs. Maria Lluch-Senar and Luis Serrano to develop a comprehensive model that represents all of the characterized genes of the bacterium M. pneumoniae.

M. pneumoniae is a small gram-positive bacterium that has one of the smallest genomes among all known freely-living organisms and that is one of the most common causes of walking pneumonia. M. pneumoniae is tractable to WC modeling because it has a small genome and a small mass; because Dr. Lluch-Senar, Dr. Serrano, and others have extensively characterized M. pneumoniae; and because most of its genome is functionally annotated. However, M. pneumoniae can be difficult to characterize because it grows slowly and because there are few experimental methods for manipulating M. pneumoniae, some aspects of M. pneumoniae are challenging to model because there is no known defined growth media for M. pneumoniae, and the M. pneumoniae research community is small. Because M. pneumoniae has such a small genome, M. pneumoniae is frequently used to study the minimal requirements of cellular life, explore the origins of cellular life, and pilot genome-scale synthetic biology methods such as whole-genome synthesis and genome transplantation. M. pneumoniae is also frequently studied to gain insights into the pathophysiology of walking pneumonia.

The model will be based both on genomic, transcriptomic, and proteomic data about M. pneumoniae collected by Drs. Lluch-Senar and Serrano, as well as a broad range of biochemical and single-cell data about related species aggregated from public databases and publications. In addition to using the model to demonstrate the feasibility of more comprehensive models and drive the development of WC modeling methods, we hope to use this model to engineer a fast-growing, efficient chassis for future bioengineering projects.

1.10.1.2. Escherichia coli

To explore how to model more complex bacteria, Prof. Markus Covert and his group at Stanford University are modeling the model gram-negative bacterium E. coli. The project focuses on E. coli because E. coli is the best-characterized bacterium and because there are a wide variety of experimental methods for manipulating and characterizing E. coli. Because E. coli is substantially more complex than reduced bacteria such as M. genitalium and M. pneumoniae, initially, this project will focus on modeling core pathways such as metabolism, RNA and protein synthesis and degradation, DNA replication, and cell division. The model will be based primarily on data observed for E. coli aggregated from a wide range of sources. Prof. Covert and his group are using this model to demonstrate the feasibility of more comprehensive WC models, as well as gain novel insights into the pathogenesis of E. coli.

1.10.1.3. H1 human embryonic stem cells (hESCs)

To explore how to model eukaryotic cells, we are also beginning to model H1-hESCs. ESCs are pluripotent cells derived from the inner cell mass of a blastocyst at 4-5 days post-fertilization that can generate all three primary germ layers. We have chosen to pilot human WC models with hESCs because they are karyotypically and phenotypically “normal”; they are genomically stable; they can self-renew; and they are relevant to a wide range of basic science, medicine, and tissue engineering.

Furthermore, we have chosen to focus on H1-hESCs because they can be cultured with feeder-free media and because they have been extensively characterized. For example, H1 was one of the three cell lines most deeply characterized by the ENCODE project [encode2012integrated]. In addition, H1 was one of the first five hESC lines [thomson1998embryonic], H1 was the first cell line to approved under NIH’s Guidelines for Stem Cell Research, and, as of 2010, H1 was studied in 30% of all hESC studies [loser2010human].

Because human cells are vastly more complex than bacteria, we are beginning by modeling the core pathways responsible for stem cell growth, maintenance, and self-renewal, including metabolism, transcription, translation, RNA and protein degradation, signal transduction, and cell cycle regulation. This model will also be based both on genomic, transcriptomic, and proteomic data about H1-hESCs aggregated from publications, as well as biochemical and single-cell data about related cell lines aggregated from several databases. In addition to using the model to demonstrate the feasibility of human WC models and driving the development of WC modeling methods, we hope to use the model to gain new insights into the biochemical mechanisms responsible for regulating the rate of stem cell growth.

1.10.2. Physiologically-centric, spatially-centric, and hybrid models

As described in Section 1.6.3-1.6.4, Klipp, Roberts, and others are also developing physiogically-centric models of S. cerevisiae, spatially-centric models of E. coli, and hybrid spatially-centric/FBA models of E. coli.

1.10.3. Technology development

Currently, we are developing three technologies for aggregating the data needed for WC modeling; concisely representing multi-algorithmic WC models using rules; and simulating rule-based, multi-algorithmic models.

1.10.3.1. Data aggregation

WC modeling requires a wide range of data. Unfortunately, as described in Section 1.7.1, aggregating this data is a major bottleneck to WC modeling because this data is scattered across a wide range of databases and publications. To help modelers obtain the data needed for WC modeling, we are developing a methodology for systematically and scalably identifying, aggregating, standardizing, and integrating the data needed for WC modeling, and we are developing a software program called Datanator which implements this methodology. The methodology consists of eight steps:

  1. Aggregation. Modelers should retrieve a wide range of data from a wide range of sources such as metabolite concentrations from ECMDB, RNA concentrations from ArrayExpress, protein concentrations from PaxDb, reaction stoichiometries from KEGG, and kinetic parameters from SABIO-RK. Where possible, this should be implemented using downloads and web services. Where this is not possible, this should be implemented by scraping web pages and manually curating individual publications. Importantly, modelers should also record the provenance of each downloaded dataset.
  2. Parsing. Modelers should parse each data source into an easily manipulatable data structure.
  3. Standardization. Modelers should standardize the identifiers, metadata, and units of their data. The metadata should include the species and environmental conditions that were observed, the method used to measure the data, the investigators who collected the data, and the citation of the original data. We recommend using absolute identifiers such as InChI to describe all possible measurements, using ontologies such as the Measurement Method Ontology (MMO) to describe metadata consistently, and using SI units.
  4. Integration. Modelers should merge the aggregated data into a single dataset. We recommend that modelers use relational databases such as SQLite to organize their data and make their data searchable.
  5. Filtering. For each model parameter that modelers would like to constrain with experimental data, modelers should identify the most relevant observations within their dataset by scoring the similarity between the physical properties of the parameter and each observation, the species that they want to model and the observed species, and the environmental condition that they want to model and the observed conditions.
  6. Reduction. For each model parameter, modelers should reduce the relevant data to constraints on the value of the parameter by calculating the mean and standard deviation of the relevant data, weighted by its similarity to the physical property, species, and environmental condition that the modeler wants to model.
  7. Review. Because it is difficult to fully describe the context of experimental measurements and, therefore, difficult to automatically identify relevant data for a model, modelers should manually review the least relevant data to potentially select alternative observations or integrate more relevant data from other sources.
  8. Storage. Lastly, modelers should store the reduced data and its provenance in a data structure that is conducive to building models. We recommend organizing this data using a specialized PGDB such as WholeCellKB.

We have already developed a common platform which implements this methodology, and data aggregation modules for the most critical data types for WC modeling. Going forward, we plan to develop additional modules for aggregating data from a wider range of sources and we plan to develop a user-friendly web-based interface for using Datanator. In addition, we hope to explore additional data aggregation methods such as natural language processing and crowdsourcing.

1.10.3.2. Model representation

As described in Section 1.7.4, no existing format is well-suited to representing composite, multi-algorithmic WC models. In particular, there is no format which is well-suited to describing all of the combinatorial complexity of cellular biochemistry, representing composite, multi-algorithmic models, and representing the semantic biological meaning and provenance of models.

To accelerate WC modeling, we are developing, wc_rules, a more abstract rule-based format for describing WC models. The format will be able to represent each molecular species at multiple levels of granularity (for example, as a single species, as a set of sites, and as a sequence); represent all of the combinatorial complexity of each molecular species and interaction; represent composite, multi-algorithmic models; represent the data, assumptions, and design decisions used to build models; and represent the semantic biological meaning of models. We are developing tools to export models described with wc_rules to BioNetGen and SBML, as well as a simulator for simulating models described with wc_rules.

1.10.3.3. Simulation of genomically-centric models

As described in Section 1.7.3, no existing simulator is well-suited to simulating computationally-expensive, high-dimensional, rule-based, multi-algorithmic WC models. In particular, there are only a few parallel simulators, only a few rule-based simulators, only a couple of multi-algorithmic simulators, and no simulator which supports all of these technologies.

To accelerate WC modeling, we are beginning to use the Viatra [varro2016road] graph transformation engine and the ROSS [carothers2002ross] PDES engine to develop wc_sim, a parallel, network-free, multi-algorithmic simulator that can simulate models described in wc_rules [goldberg2016toward]. Simulations will consist of six steps:

  1. Compile models to a low-level format. We will compile models described with wc_rules to a low-level format which can be interpreted by the simulation engine.
  2. Merge mathematically compatible submodels. We will analytically merge all mathematically-compatible submodels, producing a model which is composed of at most one FBA, one ODE, and one SSA submodel.
  3. Partition submodels into cliques. To use multiple machines to simulate models, we will partition models into cliques that can be simulated on separate machines with minimal communication to synchronize the cliques.
  4. Assign cliques to core. We will use ROSS to assign each clique to a separate machine and use event messages and rollback to synchronize their states.
  5. Co-simulate mathematically-distinct submodels. We will co-simulate the FBA, ODE, and SSA submodels by periodically calculating the fluxes predicted by FBA and ODE models and interpolating them with each SSA event.
  6. Rule-based simulation of SSA cliques. We will use Viatra to represent each species and reaction pattern as a graph and iteratively select reactions, fire reactions, and update the species graphs. To efficiently simulate both sparsely and densely concentrated species, we will use a hybrid population/particle representation in which each species graph will represent a species and its copy number, and we will periodically merge identical graphs that represent the same species.

1.11. Resources for learning about WC modeling

To learn more about WC modeling, we recommend attending a WC modeling summer school or participating in the WC modeling forum. Below are brief descriptions of these resources.

1.11.1. Summer schools

We and others organize annual WC modeling summer schools [waltemath2016toward][karr20162016][karr20172017] for graduate students and postdoctoral scholars. The schools teach the fundamental principles of WC modeling through brief lectures and hands-on exercises. The schools also provide opportunities to network with other WC researchers. Please see http://wholecell.org for information about upcoming schools.

1.11.2. Online forum

The WC modeling forum is an online platform which enables researchers to initiate and participate in discussions about WC modeling.

1.12. Outlook

Despite several challenges, we believe that WC models are rapidly becoming feasible thanks to ongoing advances in experimental and computational technology. In particular, in Section 1.9, we have proposed a three-stage plan to achieve comprehensive WC models as a community. The cornerstones of this plan include developing practical solutions to the key bottlenecks; forming a collaborative interdisciplinary community; and adhering to common interfaces, formats, identifiers, and protocols. We have already developed tools for organizing the data needed for WC modeling, organizing WC simulation results, and visualizing WC simulation results, and we have begun to organize a WC modeling community. Currently, we are developing tools for aggregating the data needed for WC modeling, concisely describing WC models, and scalably simulating WC models, and we are continuing to organize WC modeling meetings. We are eager to advance WC modeling, and hope you will join us!