Commit graph

5 commits

Author SHA1 Message Date
Carlos Crespo
860f8dbf13
[Serverless][Observability] Use roles-based testing - api_integration (#184654)
part of: [#184033](https://github.com/elastic/kibana/issues/184033)

## Summary

This PR changes the observability serverless tests (and impacted
security tests) to not run with operator privileges.


### How to test

- Follow the steps from
https://github.com/elastic/kibana/blob/main/x-pack/test_serverless/README.md#run-tests-on-mki
2024-06-07 08:34:28 -07:00
Kevin Delemme
36008b09eb
chore(slo): Add tests for historical summary api (#183648) 2024-05-17 09:32:04 -04:00
Chris Cowan
96dc2a5104
[SLO] Add dependencies for Burn Rate rule suppression (#177078)
## 🍒  Summary

This PR adds a rule dependency feature to the SLO Burn Rate rule to
enable rule suppression when one of the dependencies meets the
suppression criteria.

### 📟 Use case

When you add a rule dependency to your SLO Burn Rate rule, you will also
choose which action groups you want to suppress on. For example, if you
have rule `A` which depends on rule `B` and you want to suppress the
actions of rule `A` when rule `B` is triggering `Critical` or `High`,
you'd add rule `B` and pick the action groups `Critical` and `High`.
When rule `B` is triggering either of those action groups, ALL of the
actions for rule `A` will be suppressed.

When an action is suppressed, we will trigger a `Suppressed` action
group an set the context variable `{context.suppressedAction}` to the
action that would have been trigger if they rule wasn't suppressed. This
will allow users to create an "action" for `Suppressed` alerts so they
can still create notification without waking up the team for a
`Critical` or `High` severity alert.

If you have 2 rules that use a group by, then the suppression will
happen on the intersection of the `slo.instanceId`. For example, imagine
we have a Nginx Proxy in front of an Node.js web service and we've
created an availability SLO based on `status_code < 500` for both,
grouped-by `url.domain`. When the Node.js app responds with a `500`, the
Nginx Proxy's SLO will start to degrade because of the Node.js service.
The admins for the Nginx Proxy would like to only receive alerts if the
Node.js web services is "healthy" so they've listed the Node.js burn
rate rule as a dependency to suppress on `Critical` or `High` burn
rates.

When one of the domains, `you-got.mail`, starts to throw 500's and the
burn rate becomes `High`, the rule will suppress the alert for the
`you-got.mail` Nginx Proxy instance. If one of the other domains,
`box.mail`, for Nginx started throwing `502` because of a
mis-configuration, the alert would trigger normally because the
`box.mail` instance of the rule dependency for the Node.js web service
is still healthy (or not triggering `Critical` or `High`).

The suppression between group-by SLOs and non-group-by SLOs works like
this:

- SLO with a group-by depends on a non-grouped-by SLO, all the instances
of the group by will be suppressed.
- SLO without a group-by depends on an SLO with a group-by, the
non-grouped SLO will be suppressed if ANY of the instances of the
group-by are triggering the "suppress on" action groups.

### 💻 Screenshots

Adding a rule dependency for MongoDB to a Node.js web app

<img width="764" alt="image"
src="da2fd411-2a8e-4433-a505-2c4111e115be">

In this scenario, Nginx Proxies to Admin Console which reads data from
MongoDB. The connection between MongoDB and the Admin Console has a
network outage which causes the MongoDB rule to trigger a `Critical`
action group and suppresses the `Critical` action for the Admin Console.
The Admin Console also goes `Critical` which then suppresses the rule
for the Nginx Proxy.

<img width="1784" alt="image"
src="2db75993-8912-4769-83f8-240de811a92f">

### ⚙️ How it works

- Execute the primary rule and evaluate if should trigger any actions
- If the primary rule is triggering, execute each of the dependencies
(in the same process using the same function) and suppress when:
- For group-by SLOs that depend on another SLO with a group by, we
suppress the intersection between the instanceIds.
- For group-by SLOs that depend on a non-group-by SLO, we suppress all
the instanceIds.
- For non-group-by SLO that depends on a group-by SLO, we suppress if
ANY instanceId matches. (not recommended)

### 🔬 How to test

- Add the following lines to your `config/kibana.dev.yaml`:
  - `server.basePath: '/kibana'`
  - `server.publicBaseUrl: 'http://localhost:5601/kibana'`
- Start with the following command: `node x-pack/scripts/data_forge.js
--events-per-cycle 50 --lookback now-1d --dataset fake_stack
--install-kibana-assets --kibana-url http://localhost:5601/kibana
--event-template good`
- Wait till the log message says `info Waiting 60000ms`
- Create 2 SLOs:
- "Admin Console Availability" using the "Custom Query" SLI with the
`Admin Console` DataView, set the "Good query" to
`http.response.status_code < 500` and the set the "Total query" to
`http.response.status_code: *` using a rolling `7d` time window
- "MongoDB Availability" using the "Custom Query" SLI with the
`Heartbeat` DataView, set the "Good query" to `event.outcome: "success"`
and the set the "Total query" to `event.outcome: *` using a rolling `7d`
time window
- You should have 2 burn rate rules that were created by default
- Open the "Admin Console Availability Burn Rate rule" and add the
"MongoDB Availability Burn Rate rule" as the dependency with `Critical`
and `High` action groups to "Suppress on".
- Save the rule
- Stop the first `data_forge.js` command
- Start `node x-pack/scripts/data_forge.js --events-per-cycle 50
--lookback now --dataset fake_stack --install-kibana-assets --kibana-url
http://localhost:5601/kibana --event-template bad`

Once the Burn Rate rules go `Critical`, you should see the "MongoDB
Availability Burn Rate rule" reason message should start with
`CRITICAL:...` and the "Admin Console Availability Burn Rate rule"
reason message should start with `SUPPRESSED - CRITICAL: ...`

Fixes #173653

---------

Co-authored-by: Panagiota Mitsopoulou <giota85@gmail.com>
Co-authored-by: Dominique Clarke <doclarke71@gmail.com>
Co-authored-by: Kevin Delemme <kdelemme@gmail.com>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Dominique Clarke <dominique.clarke@elastic.co>
2024-04-16 06:25:50 -04:00
Panagiota Mitsopoulou
33ca9ece68
[Slo] serverless integration tests (#172786)
## 🍒 Summary
This PR adds basic serverless integration tests for SLO and covers 2
scenarios:
- SLO creation 
- SLO deletion 

There is another PR that adds SLO integration tests for stateful and
covers more scenarios:
- create
- delete
- update
- reset
- get/find

Current PR will cover only the create and delete scenarios. I want to
check the flakiness of these tests, before introducing new ones. Another
reason for not covering all scenarios in this PR is because we don't
want to have duplicate effort with @dominiqueclarke's
[PR](https://github.com/elastic/kibana/pull/173236). Once stateful tests
are reviewed and merged, we can come up with a plan on how/if to
continue with serverless tests and more scenarios.

**TODO**: Create a github issue to track the SLO integration tests
effort. Also we need to add privilege & space related test cases

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
2024-01-16 06:56:53 -07:00
Xavier Mouligneau
a35f91e3a5
[RAM] add observability feature for server less (#168636)
## Summary

FIX => https://github.com/elastic/kibana/issues/168034


### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: mgiota <panagiota.mitsopoulou@elastic.co>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
2023-10-31 14:27:53 -07:00