SREnity is designed to be as simple and intuitive as possible. It is highly-opinionated in its recommendations - favoring convention over configuration. However, the tool is also architected to allow extreme flexibility and control for power users and large corporations as it evolves.
It takes about five to ten minutes to get up and running with meaningful insights and data. All that is required is signing up for an account and integrating the major components of your environemnt with SREnity via API keys, cluster agents, etc.
Getting started starts with signing up for an organizational account at the SREnity Dashboard. While every account starts with an unrestricted free trial, a credit card is required to get started. You will not be charged until the end of the trial period.
Once you’ve got an account, the next step is to start hooking up all of your tooling via plugins/integrations. You can log in and go to the manage plugins page to start adding plugins.
Every tool has its own methodology for creating credentials and integrating with SREnity. Generally, we just need a read-only API key for most types of plugins. However, instructions for each plugin type will be populated in the right-hand navigation bar on the manage plugins page as you add them.
After you’ve added plugins, SREnity will automatically run the first scan on your new integration.
While the scan is taking place, a notification will appear at the top of your dashboard letting you know that your data is in the process of being updated.
Once the scan is finished, you can refresh your dashboard to see your updated notifications and rating.
The data gathered is crunched and into a variety notifications and recommendations relating to the cost, security, and stability/up-time (delivery) of your environments and developments processes.
You can review these notifications on your dashboard and deal with them accordingly.
Notifications are ranked according to weighting and percieved severity.
You can click into each notification to get further information - including more detailed descriptions, effected resources, and background reading.
As the underlying issues are addresssed and changed, the relevant notifications will disappear from your dashboard.
Take a look at the Usage section of our docs for more information.
This section will help you understand what a “test” is in SREnity, how it is structured, and generally how to write them yourself.
What is a test?
There are two types of testing going on within the SREnity system - intelligent testing and templatized testing.
Intelligent testing consists of the machine learning and statistical/artificial intelligence tests running in the background which are watching for trends/anomolies. You don’t create these yourself directly.
Templatized tests are analogous to a “desired state” for something in our infrastructure. Where we can compare the desired state of the infrastructure to the actual state and infer something worth tracking. Fundamentally, a SREnity test is comprised of two pieces - a list of data to run a comparison with, and the actual comparison we want done.
What are we testing?
The vast majority of the data which flows through the SREnity system is “state” data - this data will tell us about the current state/configuration/status of the infrastructure. Within SREnity, this data is all encoded into JSON. Because of this, we can use a JSON query language (in this case a DSL of ObjectPath) to track state changes and compare to known/desired configurations.
So how do we write our own test?
A test consists of the notification which is the short description that appears on the dashboard when an alert is created, as well as a description which appears as the “further reading” in the notification details.
For the test itself, we use a DSL (domain specific language) of the query language called ObjectPath.
Our ObjectPathThe only difference between how we use ObjectPath and the default language arises from the fact that we're applying tests to return a "true" or "false" test response.
ObjectPath by default returns objects, so we have added the first extra
selectors/operators with the default ObjectPath syntax to apply our test conditionals. The second set of extra
selectors/operators are used to provide the unique identifier of each item tested for us to report back.
Standard ObjectPath Test:
$.instances[@.memory > 9000]=> Returns a list of instances whose memory is over 9000.
SREnity ObjectPath Test:
$.instances[*][@.memory > 9000][@.name]=> Scores what % of a instances' memory are over 9000 and returns a list of instance names for those whose memory are not over 9000.
We currently provide two methods of creating/editing tests: “Basic Mode” or “Advanced Mode”.
When creating/editing in “Advanced Mode”, you can provide the exact ObjectPath query you want run.
When creating/editing in “Basic Mode”, you are directed through the ObjectPath syntax to help create simple tests. We ask you to provide a base object indicator, a set of preconditions, the actual test conditionals, and an object key locator.
Base object indicator:
This is an ObjectPath operator that can be used to limit the list provided by the *base object indicator* to only test a subset of the list: (Note: @ references the object being compared against.)
@.instance_id == "i-123456789" # Would only apply our test to that specific instance.
This is an ObjectPath operator that contains the actual state we’d like to test for among our final subset of objects: (Note: @ references the object being compared against.)
@.status == "OK" # Would make sure that all of our instances we're testing are in "status" of "OK"
Object key locator:
This is an ObjectPath selector indicating the uniquely identifying field of the object we’re testing so the system can report back on error: (Note: @ references the object being compared against.)
Tests are weighted from 0 to 1 across three categories (Security, Cost, Delivery) and three sub-categories (Architecture, Support, Maintenance) based on how critical the information they test for is. For example: a test for a severe security vulnerability will have a higher rating (closer to 1) than a test for servers that aren’t following proper naming conventions/policies.
Security: Tests that relate to the overall security of the system like public network exposure, secure system configuration, patching, passwords, IAM, etc.
- Architecture: Relates to security posture of the overall design and setup of the environment.
- Support: Relates to how the security posture effects short-term supportability of the environment.
- Maintenance: Relates to how the security posture impacts long-term maintenance costs.
Cost: Tests that relate to the overall cost-effectiveness system like reserved instance usage, proper instance sizing, any sort of unused, paid capacity, etc.
- Architecture: Relates to the relative cost-effectivness of the overall design and setup of the environment.
- Support: Relates to the relative cost-effectivness of the short-term supportability of the environment.
- Maintenance: Relates to the relative cost-effectiveness of long-term maintenance costs.
Delivery: Tests that relate to the overall delivery the system/end product like uptime, response times, SLAs, maximum concurrent sessions, etc.
- Architecture: Relates to the robustness/stability of the overall design and setup of the environment.
- Support: Relates to the robustness/stability of the environment and its impact on short-term supportability/management of the environment.
- Maintenance: Relates to the robustness/stability of the environment and its impact on long-term maintenance costs.
Say we have a test which ensures our MySQL security patches are up-to-date. These could be the example weightings.
We generally try to avoid “double counting” on test weightings (i.e. one could argue that security patching also has an impact on delivery and/or cost, but it is primarily a security issue) unless the issue is so severe that it warrants doing so to raise it to the top of our list. Likewise, when the impact is spread evenly across sub-categories, we try to just split the weight evenly between them.
As a rule of thumb, a highly impactful test will have a total score total of 2.5+. A test of average importance will be weighted in the 1.5 - 2.5 range. A test of low imporance will range between 0.5 - 1.5 in weighting.