Design Concepts - Data Model

The ability to conduct effective analysis across a Synapse hypergraph is highly dependent on the design and implementation of an appropriate data model for a given knowledge domain. The specifics of any data model (types, forms, properties) will vary based on both the knowledge domain and specific analytical needs.

A full discussion of the considerations (and potential complexities) of creating a well-designed data model are beyond the scope of this documentation. However, there are a few general principles and recommendations that should be kept in mind when developing a data model:

  • The model is an abstraction. Analysis often involves subtle distinctions and qualifications; this is why analysis is often provided in long-form reporting, where natural language can be used to convey variations in confidence or provide caveats surrounding conclusions. Capturing data and corresponding analysis in a formalized data model trades some of these subtleties for consistent representation and programmatic accessibility. A data model can never fully capture the richness and detail of a long-form report. But a well-designed data model can capture critical components in a way that conveys sufficient depth of analytical findings so that an analyst only rarely needs to refer to additional long-form reporting or original sourcing for clarification.

  • The model should be self-evident. While the model is necessarily an abstraction, it should not be abstracted to the point where the data and analysis in the model cannot stand on their own. That is, while there are times when supplemental external reports or explanatory notes may be helpful, they should not be required in order to understand the information in a Cortex. The model should be designed to convey the maximum amount of information possible: entities, relationships, and analytical annotations should be unambiguous, well-defined, and clearly understood. An analyst with domain knowledge but no prior exposure to analytical findings should be able to look at the information represented in a Cortex and understand the analytical line of thought.

  • The model should be driven by real-world analytical need and analytical relevance. Any model, regardless of knowledge domain, should be designed around the analytical questions that need to be answered by that model. Many models are designed as academic abstractions (“how would we classify all possible exploitable vulnerabilities in software?”) without consideration for the practical questions that the data is intended to answer. Are some exploits theoretically possible, but never yet observed in the real world? Are some distinctions too fine-grained (or not fine-grained enough) for your analytical needs? Domain knowledge experts should have significant input into the type of data modeled, what analysis needs to be performed on the data, and how it should be represented.

    The best models evolve in a cycle of forethought combined with real-world stress-testing. Creating a model “on the fly” with no prior consideration often leads to a narrowly-focused, fragmented data model – in the immediacy of detailed analysis, analysts (or developers) may focus on the trees while missing the big picture of the forest. However, even the most well-thought-out model planned in the abstract will still fall short when faced with the vagaries and inconsistencies real-world data. Experience has shown that there are always edge cases, circumstances, or anomalies that cannot be anticipated. The most effective models are typically those that are planned carefully up front, then tested against real-world data and refined before being placed fully into production.

  • Test the basics and build from there. No data model is set in stone – in fact, a good model will both expand and evolve with analytical need. That said, changes to the model may require revising or updating existing model elements and associated analysis, and some changes are easier to effect than others. When introducing a new form to the model, consider carefully what the “essence” of that object is - what makes it unqiue, and therefore how it should “look” within the model - and design a form to capture that. It’s fine (and in some ways preferable) to start with a limited or “stub” form while you test it against real world data. It is relatitvely easy to make additive changes to the data model (introduce new forms or new secondary properties). It is more challenging to modify the model once you have used a form to capture and encode data into nodes, because those changes may require migrating existing data to account for your changes. For this reason it may preferable to start with more basic elements and build on them once your design can be fully stress-tested.