## Summary
Resolves: https://github.com/elastic/kibana/issues/151463
Removes all reference to ephemeral tasks from the task manager plugin.
As well as unit and E2E tests while maintaining backwards compatibility
for `xpack.task_manager.ephemeral_tasks` flag to no-op if set. This PR
has some dependencies from the PR to remove ephemeral task support from
the alerting and actions plugin
(https://github.com/elastic/kibana/pull/197421). So it should be merged
after the other PR.
Deprecates the following configuration settings:
- xpack.task_manager.ephemeral_tasks.enabled
- xpack.task_manager.ephemeral_tasks.request_capacity
The user doesn't have to change anything on their end if they don't wish
to. This deprecation is made so if the above settings are defined,
kibana will simply do nothing.
### Checklist
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
## Summary
Implements a security_solution task scheduled to run once a day to
collect the following information:
1. Datastreams stats
2. Indices stats
3. ILMs stats
4. ILM configs
The task allows a runtime configuration to limit the number of indices
and data streams to analyze or event to disable the feature entirely.
Once the data is gathered, the task sends it as EBT events.
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
## Summary
Closes https://github.com/elastic/kibana/issues/193352
Update:
Using a new SO field `bump_agent_policy_revision` in package policy type
to mark package policies for update, this will trigger an agent policy
revision bump.
The feature supports both legacy and new package policy SO types, and
queries policies from all spaces.
To test, add a model version change to the package policy type and save.
After Fleet setup is run, the agent policies using the package policies
should be bumped and deployed.
The same effect can be achieved by manually updating a package policy SO
and loading Fleet UI to trigger setup.
```
'2': {
changes: [
{
type: 'data_backfill',
backfillFn: (doc) => {
return { attributes: { ...doc.attributes, bump_agent_policy_revision: true } };
},
},
],
},
curl -sk -XPOST --user fleet_superuser:password -H 'content-type:application/json' \ -H'x-elastic-product-origin:fleet' \
http://localhost:9200/.kibana_ingest/_update_by_query -d '
{ "query": {
"match": {
"type": "fleet-package-policies"
}
},"script": {
"source": "ctx._source[\"fleet-package-policies\"].bump_agent_policy_revision = true",
"lang": "painless"
}
}'
```
```
[2024-11-20T14:40:30.064+01:00][INFO ][plugins.fleet] Found 1 package policies that need agent policy revision bump
[2024-11-20T14:40:31.933+01:00][DEBUG][plugins.fleet] Updated 1 package policies in space space1 in 1869ms, bump 1 agent policies
[2024-11-20T14:40:35.056+01:00][DEBUG][plugins.fleet] Deploying 1 policies
[2024-11-20T14:40:35.493+01:00][DEBUG][plugins.fleet] Deploying policies: 7f108cf2-4cf0-4a11-8df4-fc69d00a3484:10
```
TODO:
- the same flag has to be added on agent policy and output types, and
the task extended to update them
- I plan to do this in another pr, so that this doesn't become too big
- add integration test if possible
### Scale testing
Tested with 500 agent policies split to 2 spaces, 1 integration per
policy and bumping the flag in a new saved object model version, the
bump task took about 6s.
The deploy policies step is async, took about 30s.
```
[2024-11-20T15:53:55.628+01:00][INFO ][plugins.fleet] Found 501 package policies that need agent policy revision bump
[2024-11-20T15:53:57.881+01:00][DEBUG][plugins.fleet] Updated 250 package policies in space space1 in 2253ms, bump 250 agent policies
[2024-11-20T15:53:59.926+01:00][DEBUG][plugins.fleet] Updated 251 package policies in space default in 4298ms, bump 251 agent policies
[2024-11-20T15:54:01.186+01:00][DEBUG][plugins.fleet] Deploying 250 policies
[2024-11-20T15:54:29.989+01:00][DEBUG][plugins.fleet] Deploying policies: test-policy-space1-1:4, ...
[2024-11-20T15:54:33.538+01:00][DEBUG][plugins.fleet] Deploying policies: policy-elastic-agent-on-cloud:4, test-policy-default-1:4, ...
```
### Checklist
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
## Summary
Close https://github.com/elastic/kibana/issues/193473
Close https://github.com/elastic/kibana/issues/193474
This PR utilize the documentation packages that are build via the tool
introduced by https://github.com/elastic/kibana/pull/193847, allowing to
install them in Kibana and expose documentation retrieval as an LLM task
that AI assistants (or other consumers) can call.
Users can now decide to install the Elastic documentation from the
assistant's config screen, which will expose a new tool for the
assistant, `retrieve_documentation` (only implemented for the o11y
assistant in the current PR, shall be done for security as a follow up).
For more information, please refer to the self-review.
## General architecture
<img width="1118" alt="Screenshot 2024-10-17 at 09 22 32"
src="https://github.com/user-attachments/assets/3df8c30a-9ccc-49ab-92ce-c204b96d6fc4">
## What this PR does
Adds two plugin:
- `productDocBase`: contains all the logic related to product
documentation installation, status, and search. This is meant to be a
"low level" components only responsible for this specific part.
- `llmTasks`: an higher level plugin that will contain various LLM tasks
to be used by assistants and genAI consumers. The intent is not to have
a single place to put all llm tasks, but more to have a default place
where we can introduce new tasks from. (fwiw, the `nlToEsql` task will
probably be moved to that plugin).
- Add a `retrieve_documentation` tool registration for the o11y
assistant
- Add a component on the o11y assistant configuration page to install
the product doc
(wiring the feature to the o11y assistant was done for testing purposes
mostly, any addition / changes / enhancement should be done by the
owning team - either in this PR or as a follow-up)
## What is NOT included in this PR:
- Wire product base feature to the security assistant (should be done by
the owning team as a follow-up)
- installation
- utilization as tool
- FTR tests: this is somewhat blocked by the same things we need to
figure out for https://github.com/elastic/kibana-team/issues/1271
## Screenshots
### Installation from o11y assistant configuration page
<img width="1476" alt="Screenshot 2024-10-17 at 09 41 24"
src="https://github.com/user-attachments/assets/31daa585-9fb2-400a-a2d1-5917a262367a">
### Example of output
#### Without product documentation installed
<img width="739" alt="Screenshot 2024-10-10 at 09 59 41"
src="https://github.com/user-attachments/assets/993fb216-6c9a-433f-bf44-f6e383d20d9d">
#### With product documentation installed
<img width="718" alt="Screenshot 2024-10-10 at 09 55 38"
src="https://github.com/user-attachments/assets/805ea4ca-8bc9-4355-a434-0ba81f8228a9">
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Alex Szabo <alex.szabo@elastic.co>
Co-authored-by: Matthias Wilhelm <matthias.wilhelm@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Resolves https://github.com/elastic/kibana/issues/192686
## Summary
Creates a background task to search for removed task types and mark them
as unrecognized. Removes the current logic that does this during the
task claim cycle for both task claim strategies.
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Closes https://github.com/elastic/kibana/issues/184069
**The Problem**
The LLM decides the identifier (both `_id` and `doc_id`) for knowledge
base entries. The `_id` must be globally unique in Elasticsearch but the
LLM can easily pick the same id for different users thereby overwriting
one users learning with another users learning.
**Solution**
The LLM should not pick the `_id`. With this PR a UUID is generated for
new entries. This means the LLM will only be able to create new KB
entries - it will not be able to update existing ones.
`doc_id` has been removed, and replaced with a `title` property. Title
is simply a human readable string - it is not used to identify KB
entries.
To retain backwards compatability, we will display the `doc_id` if
`title` is not available
---------
Co-authored-by: Sandra G <neptunian@users.noreply.github.com>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
## Summary
The Security Solution Entity Store feature will now be available by
default. However, there will be a flag that can be switched on, if
desired, to **disable** that feature entirely.
Regardless of whether this flag is enabled or not, Security's Entity
Store is still only fully enabled through an enablement workflow. In
other words, a Security Solution customer must turn on the feature
through an onboarding workflow in order to enable its features.
Additionally, we are disabling this feature in Serverless at first, to
perform proper Serverless load/performance testing. (We do not expect it
to be significantly different than ESS/ECH, but are doing so out of an
abundance of caution).
---------
Co-authored-by: Pablo Machado <pablo.nevesmachado@elastic.co>
## Summary
Closes https://github.com/elastic/kibana/issues/189506
Testing steps:
- enable deleting unenrolled agents by adding
`xpack.fleet.enableDeleteUnenrolledAgents: true` to `kibana.dev.yml` or
turn it on on the UI
- add some unenroll agents with the helper script
```
cd x-pack/plugins/fleet
node scripts/create_agents/index.js --status unenrolled --count 10
info Creating 10 agents with statuses:
info unenrolled: 10
info Batch complete, created 10 agent docs, took 0, errors: false
info All batches complete. Created 10 agents in total. Goodbye!
```
- restart kibana or wait for the task to run and verify that the
unenrolled agents were deleted
```
[2024-10-08T16:14:45.152+02:00][DEBUG][plugins.fleet.fleet:delete-unenrolled-agents-task:0.0.5] [DeleteUnenrolledAgentsTask] Executed deletion of 10 unenrolled agents
[2024-10-08T16:14:45.153+02:00][INFO ][plugins.fleet.fleet:delete-unenrolled-agents-task:0.0.5] [DeleteUnenrolledAgentsTask] runTask ended: success
```
Added to UI settings:
<img width="1057" alt="image"
src="https://github.com/user-attachments/assets/2c9279f9-86a8-4630-a6cd-5aaa42e05fe7">
If the flag is preconfigured, disabled update on the UI with a tooltip:
<img width="1009" alt="image"
src="https://github.com/user-attachments/assets/45041020-6447-4295-995e-6848f0238f88">
The update is also prevented from the API:
<img width="2522" alt="image"
src="https://github.com/user-attachments/assets/cfbc8e21-e062-4e7f-9d08-9767fa387752">
Once the preconfiguration is removed, the UI update is allowed again.
### Checklist
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
## Summary
Resolves https://github.com/elastic/kibana/issues/188043
This PR adds new connector which is define integration with Elastic
Inference Endpoint via [Inference
APIs](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-apis.html)
The lifecycle of the Inference Endpoint are managed by the connector
registered handlers:
- `preSaveHook` -
[create](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html)
new Inference Endpoint in the connector create mode (`isEdit === false`)
and
[delete](https://www.elastic.co/guide/en/elasticsearch/reference/current/delete-inference-api.html)+[create](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html)
in the connector edit mode (`isEdit === true`)
- `postSaveHook` - check if the connector SO was created/updated and if
not removes Inference Endpoint from preSaveHook
- `postDeleteHook` -
[delete](https://www.elastic.co/guide/en/elasticsearch/reference/current/delete-inference-api.html)
Inference Endpoint if connector was deleted.
In the Kibana Stack Management Connectors, its represented with the new
card (Technical preview badge):
<img width="1261" alt="Screenshot 2024-09-27 at 2 11 12 PM"
src="https://github.com/user-attachments/assets/dcbcce1f-06e7-4d08-8b77-0ba4105354f8">
To simplify the future integration with AI Assistants, the Connector
consists from the two main UI parts: provider selector and required
provider settings, which will be always displayed
<img width="862" alt="Screenshot 2024-10-07 at 7 59 09 AM"
src="https://github.com/user-attachments/assets/87bae493-c642-479e-b28f-6150354608dd">
and Additional options, which contains optional provider settings and
Task Type configuration:
<img width="861" alt="Screenshot 2024-10-07 at 8 00 15 AM"
src="https://github.com/user-attachments/assets/2341c034-6198-4731-8ce7-e22e6c6fb20f">
subActions corresponds to the different taskTypes Inference API
supports. Each of the task type has its own Inference Perform params.
Currently added:
- completion & completionStream
- rerank
- text_embedding
- sparse_embedding
Follow up work:
1. Collapse/expand Additional options, when the connector flyout/modal
has AI Assistant as a context (path through the extending context
implementation on the connector framework level)
2. Add support for additional params for Completion subAction to be able
to path functions
3. Add support for tokens usage Dashboard, when inference API will
include the used tokens count in the response
4. Add functionality and UX for migration from existing specific AI
connectors to the Inference connector with proper provider and
completion task
5. Integrate Connector with the AI Assistants
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
Co-authored-by: Steph Milovic <stephanie.milovic@elastic.co>
Resolves https://github.com/elastic/kibana/issues/192568
In this PR, I'm solving the issue where the task manager health API is
unable to determine how many Kibana nodes are running. I'm doing so by
leveraging the Kibana discovery service to get a count instead of
calculating it based on an aggregation on the `.kibana_task_manager`
index where we count the unique number of `ownerId`, which requires
tasks to be running and a sufficient distribution across the Kibana
nodes to determine the number properly.
Note: This will only work when mget is the task claim strategy
## To verify
1. Set `xpack.task_manager.claim_strategy: mget` in kibana.yml
2. Startup the PR locally with Elasticsearch and Kibana running
3. Navigate to the `/api/task_manager/_health` route and confirm
`observed_kibana_instances` is `1`
4. Apply the following code and restart Kibana
```
diff --git a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
index 090847032bf..69dfb6d1b36 100644
--- a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
+++ b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
@@ -59,6 +59,7 @@ export class KibanaDiscoveryService {
const lastSeen = lastSeenDate.toISOString();
try {
await this.upsertCurrentNode({ id: this.currentNode, lastSeen });
+ await this.upsertCurrentNode({ id: `${this.currentNode}-2`, lastSeen });
if (!this.started) {
this.logger.info('Kibana Discovery Service has been started');
this.started = true;
```
5. Navigate to the `/api/task_manager/_health` route and confirm
`observed_kibana_instances` is `2`
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
## Summary
Follow-up to #190690
Most of API integration tests does not match the path pattern set in the
original PR (thanks @pheyos for catching it) and where not updated.
This PR updates `.eslintrc.js` with explicit patterns to lint
api_integration tests. Hopefully it is final change, but I rely on code
owners to double check it.
Most of the changes are trivial adjustments:
- duplicated before/after hooks `mocha/no-sibling-hooks`
- duplicated test titles `mocha/no-identical-title`
- async function in describe() `mocha/no-async-describe`
---------
Co-authored-by: Ash <1849116+ashokaditya@users.noreply.github.com>
## Summary
This PR enforces ESLint rules in FTR tests, in particular:
- `no-floating-promises` rule to catch unawaited Promises in
tests/services/page objects
_Why is it important?_
- Keep correct test execution order: cleanup code may run before the
async operation is completed, leading to unexpected behavior in
subsequent tests
- Accurate test results: If a test completes before an async operation
(e.g., API request) has finished, Mocha might report the test as passed
or failed based on incomplete context.
```
198:11 error Promises must be awaited, end with a call to .catch, end with a call to .then
with a rejection handler or be explicitly marked as ignored with the `void` operator
@typescript-eslint/no-floating-promises
```
<img width="716" alt="Screenshot 2024-08-20 at 14 04 43"
src="https://github.com/user-attachments/assets/9afffe4c-4b51-4790-964c-c44a76baed1e">
- recommended rules from
[eslint-mocha-plugin](https://www.npmjs.com/package/eslint-plugin-mocha)
including:
-
[no-async-describe](https://github.com/lo1tuma/eslint-plugin-mocha/blob/main/docs/rules/no-async-describe.md)
-
[no-identical-title.md](https://github.com/lo1tuma/eslint-plugin-mocha/blob/main/docs/rules/no-identical-title.md)
-
[no-sibling-hooks.md](https://github.com/lo1tuma/eslint-plugin-mocha/blob/main/docs/rules/no-sibling-hooks.md)
and others
Note for reviewers: some tests were skipped due to failures after
missing `await` was added. Most likely is a "false positive" case when
test is finished before async operation is actually completed. Please
work on fixing and re-enabling it
---------
Co-authored-by: Tiago Costa <tiago.costa@elastic.co>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
In this PR, I'm renaming the task managers as we prepare to rollout the
`mget` task claiming strategy as the default.
Rename:
- `unsafe_mget` -> `mget`
- `default` -> `update_by_query`
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
## Summary
Redoing the resource based task claim PR:
https://github.com/elastic/kibana/pull/187999 and followup PRs
https://github.com/elastic/kibana/pull/189220 and
https://github.com/elastic/kibana/pull/189117. Please see the
descriptions of those PRs for more details.
This was original reverted because unregistered task types in serverless
caused the task manager health aggregation to fail. This PR includes an
additional commit to exclude unregistered task types from the health
report:
58eb2b1db7.
To verify this, make sure you're using the `default` claim strategy,
start up Kibana so that the default set of tasks get created. Then
either disable a bunch of plugins via config:
```
# remove security and o11y
enterpriseSearch.enabled: false
xpack.apm.enabled: false
xpack.cloudSecurityPosture.enabled: false
xpack.fleet.enabled: false
xpack.infra.enabled: false
xpack.observability.enabled: false
xpack.observabilityAIAssistant.enabled: false
xpack.observabilityLogsExplorer.enabled: false
xpack.search.notebooks.enabled: false
xpack.securitySolution.enabled: false
xpack.uptime.enabled: false
```
or comment out the task registration of a task that was previously
scheduled (I'm using the observability AI assistant)
```
--- a/x-pack/plugins/observability_solution/observability_ai_assistant/server/service/index.ts
+++ b/x-pack/plugins/observability_solution/observability_ai_assistant/server/service/index.ts
@@ -89,24 +89,24 @@ export class ObservabilityAIAssistantService {
this.allowInit();
- taskManager.registerTaskDefinitions({
- [INDEX_QUEUED_DOCUMENTS_TASK_TYPE]: {
- title: 'Index queued KB articles',
- description:
- 'Indexes previously registered entries into the knowledge base when it is ready',
- timeout: '30m',
- maxAttempts: 2,
- createTaskRunner: (context) => {
- return {
- run: async () => {
- if (this.kbService) {
- await this.kbService.processQueue();
- }
- },
- };
- },
- },
- });
+ // taskManager.registerTaskDefinitions({
+ // [INDEX_QUEUED_DOCUMENTS_TASK_TYPE]: {
+ // title: 'Index queued KB articles',
+ // description:
+ // 'Indexes previously registered entries into the knowledge base when it is ready',
+ // timeout: '30m',
+ // maxAttempts: 2,
+ // createTaskRunner: (context) => {
+ // return {
+ // run: async () => {
+ // if (this.kbService) {
+ // await this.kbService.processQueue();
+ // }
+ // },
+ // };
+ // },
+ // },
+ // });
}
```
and restart Kibana. You should still be able to access the TM health
report with the workload field and if you update the background health
logging so it always logs and more frequently, you should see the
logging succeed with no errors:
Below, I've made changes to always log the background health at a 15
second interval:
```
--- a/x-pack/plugins/task_manager/server/plugin.ts
+++ b/x-pack/plugins/task_manager/server/plugin.ts
@@ -236,6 +236,7 @@ export class TaskManagerPlugin
if (this.isNodeBackgroundTasksOnly()) {
setupIntervalLogging(monitoredHealth$, this.logger, LogHealthForBackgroundTasksOnlyMinutes);
}
+ setupIntervalLogging(monitoredHealth$, this.logger, LogHealthForBackgroundTasksOnlyMinutes);
reduce the logging interval
--- a/x-pack/plugins/task_manager/server/lib/log_health_metrics.ts
+++ b/x-pack/plugins/task_manager/server/lib/log_health_metrics.ts
@@ -35,7 +35,8 @@ export function setupIntervalLogging(
monitoredHealth = m;
});
- setInterval(onInterval, 1000 * 60 * minutes);
+ // setInterval(onInterval, 1000 * 60 * minutes);
+ setInterval(onInterval, 1000 * 15);
function onInterval() {
```
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
# Backport
This will backport the following commits from `deploy-fix@1722233551` to
`main`:
- [Revert TM resource based task scheduling issues
(#189529)](https://github.com/elastic/kibana/pull/189529)
<!--- Backport version: 8.9.8 -->
### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)
<!--BACKPORT [{"author":{"name":"Ying
Mao","email":"ying.mao@elastic.co"},"sourceCommit":{"committedDate":"2024-07-30T16:53:27Z","message":"Revert
TM resource based task scheduling issues
(#189529)","sha":"32459096ff32fa4523fea5f17e1ff9aa881dbef7","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":[],"number":189529,"url":"https://github.com/elastic/kibana/pull/189529","mergeCommit":{"message":"Revert
TM resource based task scheduling issues
(#189529)","sha":"32459096ff32fa4523fea5f17e1ff9aa881dbef7"}},"sourceBranch":"deploy-fix@1722233551","suggestedTargetBranches":[],"targetPullRequestStates":[]}]
BACKPORT-->
Resolves https://github.com/elastic/kibana/issues/185043
## Summary
### Task types can define a `cost` associated with running it
- Optional definition that defaults to `Normal` cost
### New `xpack.task_manager.capacity` setting
- Previous `xpack.task_manager.max_workers` setting is deprecated,
changed to optional, and a warning will be logged if used
- New optional `xpack.task_manager.capacity` setting is added. This
represents the number of normal cost tasks that can be run at one time.
- When `xpack.task_manager.max_workers` is defined and
`xpack.task_manager.capacity` is not defined, a deprecation warning is
logged and the value for max workers will be used as the capacity value.
- When `xpack.task_manager.capacity` is defined and
`xpack.task_manager.max_workers` is not defined, the capacity value will
be used. For the `default` claiming strategy, this capacity value will
be used as the `max_workers` value
- When both values are set, a warning will be logged and the value for
`xpack.task_manager.capacity` will be used
- When neither value is set, the `DEFAULT_CAPACITY` value will be used.
### Updates to `TaskPool` class
- Moves the logic to determine used and available capacity so that we
can switch between capacity calculators based on claim strategy. For the
`default` claim strategy, the capacity will be in units of workers. For
the `mget` claim strategy, the capacity will be in units of task cost.
### Updates to `mget` task claimer
- Updated `taskStore.fetch` call to take a new parameter that will
return a slimmer task document that excludes that task state and task
params. This will improve the I/O efficiency of returning up to 400 task
docs in one query
- Applies capacity constraint to the candidate tasks.
- Bulk gets the full task documents for the tasks we have capacity for
in order to update them to `claiming` status. Uses the
`SavedObjectsClient.bulkGet` which uses an `mget` under the hood.
### Updates the monitoring stats
- Emitting capacity config value and also capacity as translated into
workers and cost.
- Added total cost of running and overdue tasks to the health report
## Tasks for followup issues
- Update mget functional tests to include tasks with different costs. -
https://github.com/elastic/kibana/issues/189111
- Update cost of indicator match rule to be Extra Large -
https://github.com/elastic/kibana/issues/189112
- Set `xpack.task_manager.capacity` on ECH based on the node size -
https://github.com/elastic/kibana/pull/189117
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Resolves: #187696
This PR introduces Kibana Discovery Service for the TaskManager plugin.
- Creates a new SO type in the TaskManagerIndex.
- The SO has 2 fields: `id` (holds the kibana node id) and `last_seen`
(timestamp of the last update applied by the node)
- Discovery Service in TM creates an SO on start and updates its
last_seen field every 10s
- The service also deletes the SOs that haven't been updated in the last
5m, by checking the index every 1m.
- TM deletes its SO on plugin stop.
- Discovery Service provides an API (`getActiveKibanaNodes`) to get the
active kibana nodes (last_seen field has been updated in the last 30s)
## To verify:
Run your Kibana locally and check the `.kibana_task_manager` index with
the below query, there should be an SO and its last_seen field should be
updated every 10s.
```
{
"query": {
"term": {
"type": "background-task-node"
}
},
"size" : 10
}
```
---
The PR has been deployed to cloud as well, you can check the SOs for
multiple Kibana instances there.
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Resolves https://github.com/elastic/kibana/issues/181145
## Summary
Adds an optional flag `shouldDeleteTask` to a successful task run
result. If this flag is set to true, task manager will remove the task
at the end of the processing cycle. This allows tasks to gracefully
inform us that they need to be deleted without throwing an unrecoverable
error (the current way that tasks tell us they want to be deleted).
Audited existing usages of `throwUnrecoverableError`. Other than usages
within the alerting and actions task runner, which are thrown for valid
error states, all other usages were by tasks that were considered
outdated and should be deleted. Updated all those usages to return the
`shouldDeleteTask` run result.
---------
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
## Summary
Reverted
557633456c
from deploy@1717401777 as part of emergency release. This PR is
following the emergency release guidelines to:
`In a separate PR, the fix should be "frontported" to main by manually
cherry-picking the commit.`
## Summary
This updates the task manager metrics aggregator to collect and emit
metrics when a `reset$` event is observed.
The `/api/task_manager/metrics` route subscribes to and saves the latest
task manager metrics and immediately returns the latest metrics when the
API is accessed. At a minimum, metrics are collected and emitted at
every polling interval (every 3 seconds). Usually emission is more
frequent than this because we emit metrics events every time a task run
completes.
Under normal circumstances, when the agent is configured to collect from
the API once every 10 seconds, this is what happens
```
00:00:00 metrics$.subscribe(({errors: 3}) => lastMetrics = metrics) - metrics emitted and saved
00:00:03 metrics$.subscribe(({errors: 4}) => lastMetrics = metrics) - metrics emitted and saved
00:00:05 API called with reset=true, return lastMetrics, metrics reset to 0
00:00:06 metrics$.subscribe(({errors: 1}) => lastMetrics = metrics) - metrics emitted and saved
00:00:09 metrics$.subscribe(({errors: 2}) => lastMetrics = metrics) - metrics emitted and saved
00:00:10 API called with reset=true, return lastMetrics, metrics reset to 0
```
We can see that the metrics are reset and then by the time the next
collection interval comes around, fresh metrics have been emitted.
We currently have an issue where the API is collected against twice in
quick succession. Most of the time, this leads to duplicate metrics
being collected.
```
00:00:00:00 metrics$.subscribe(({errors: 3}) => lastMetrics = metrics) - metrics emitted and saved
00:00:03:00 metrics$.subscribe(({errors: 4}) => lastMetrics = metrics) - metrics emitted and saved
00:00:05:00 API called with reset=true, return lastMetrics, metrics reset to 0
00:00:05:01 API called with reset=true, return lastMetrics, metrics reset to 0 - this is a duplicate
00:00:06:00 metrics$.subscribe(({errors: 1}) => lastMetrics = metrics) - metrics emitted and saved
00:00:09:00 metrics$.subscribe(({errors: 2}) => lastMetrics = metrics) - metrics emitted and saved
```
However sometimes, this leads to a race condition that leads to
different metrics being collected.
```
00:00:00:00 metrics$.subscribe(({errors: 3}) => lastMetrics = metrics) - metrics emitted and saved
00:00:03:00 metrics$.subscribe(({errors: 4}) => lastMetrics = metrics) - metrics emitted and saved
00:00:05:00 API called with reset=true, return lastMetrics, metrics reset to 0
00:00:05:01 metrics$.subscribe(({errors: 1}) => lastMetrics = metrics) - metrics emitted and saved
00:00:05:02 API called with reset=true, return lastMetrics, metrics reset to 0
00:00:06:00 metrics$.subscribe(({errors: 1}) => lastMetrics = metrics) - metrics emitted and saved
00:00:09:00 metrics$.subscribe(({errors: 2}) => lastMetrics = metrics) - metrics emitted and saved
```
With this PR, on every reset, we'll re-emit the metrics so so even in
the face of the duplicate collection, we won't be emitting duplicate
metrics. After this is deployed, we should not need to exclude
`kubernetes.container.name :"elastic-internal-init-config"` from the
dashboards
```
00:00:00:00 metrics$.subscribe(({errors: 3}) => lastMetrics = metrics) - metrics emitted and saved
00:00:03:00 metrics$.subscribe(({errors: 4}) => lastMetrics = metrics) - metrics emitted and saved
00:00:05:00 API called with reset=true, return lastMetrics, metrics reset to 0
00:00:05:00 metrics$.subscribe(({errors: 0}) => lastMetrics = metrics) - metrics emitted and saved
00:00:05:01 API called with reset=true, return lastMetrics, metrics reset to 0
00:00:05:01 metrics$.subscribe(({errors: 0}) => lastMetrics = metrics) - metrics emitted and saved
00:00:06:00 metrics$.subscribe(({errors: 1}) => lastMetrics = metrics) - metrics emitted and saved
00:00:09:00 metrics$.subscribe(({errors: 2}) => lastMetrics = metrics) - metrics emitted and saved
```
## Summary
- Enables the `responseActionsSentinelOneV2Enabled` feature flag for
`main`
- This same FF was enabled already in 8.14 via:
https://github.com/elastic/kibana/pull/182384
- Changes the background task for completing response actions to have an
initial timeout of `5m` (instead of `20m`)
## Summary
Creates a system connector that can call the observability ai assistant
to execute actions on behalf of user. The connector is tagged as tech
preview.
The connector can be triggered when an alert fires. Connector can be
configured with an initial message to the assistant which generates an
answer and triggers potential actions on the assistant side. The current
experimental scenario is to ask the assistant to generate a report of
the alert that fired (by initially providing some context in the first
message), recalling any information/potential resolutions of previous
occurrences stored in the knowledge base and also including other active
alerts that may be related. One last step that can be asked to the
assistant is to trigger an action, currently only sending the report (or
any other message) to a preconfigured slack webhook is supported.
## Testing
_Note: when asked to send a message to another connector (in our case
slack), we'll try to include a link to the generated conversation. It is
only possible to generate this link if
[server.publicBaseUrl](https://www.elastic.co/guide/en/kibana/current/settings.html#server-publicBaseUrl)
is correctly set in kibana settings._
- Create a slack webhook connector
- Get slack webhook. I can share one and invite you to the workspace, or
if you want to create one:
- create personal workspace at https://slack.com/signin#workspaces
- create an app for that workspace at https://api.slack.com/apps
- under Features > OAuth & Permissions > Scopes > Bot Token Scopes, add
`incoming-webhook` permission
- install the app
- webhook url is available under Features > Incoming Webhooks
- Create a rule that can be triggered with available documents and
attach observability AI assistant connector. (I use `Error Count
Threshold` and generate errors via `node scripts/synthtrace
many_errors.ts --live`)
- configure the connector with one genai connector and a message with
instructions. Example:
```
High error count alert has triggered. Execute the following steps:
- create a graph of the error count for the service impacted by the alert for the last 24h
- to help troubleshoot recall past occurrences of this alarm, also any other active alerts. Generate a report with all the found informations and send it to slack connector as a single message. Also include the link to this conversation in the report
```
- Track alert status and verify connector was executed. You should get a
slack notification sent by the assistant, and a new conversation will be
stored
TODO
- unit/integration tests - see
https://github.com/elastic/kibana/pull/168369 for reference
implementation
- documentation
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Dario Gieselaar <dario.gieselaar@elastic.co>
Towards: #176585
This PR removes the task skipping logic from TaskManager, PRs for
Alerting and Actions will follow.
## To verify
Rules and actions should be still working as expected.
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Resolves https://github.com/elastic/kibana/issues/174353
## Summary
Adds ability for task instance to specify a timeout override that will
be used in place of the task type timeout when running an ad-hoc task.
In the future we may consider allowing timeout overrides for recurring
tasks but this PR limits usage to only ad-hoc task runs.
This timeout override is planned for use by backfill rule execution
tasks so the only usages in this PR are in the functional tests.
Resolves https://github.com/elastic/kibana/issues/174352
## Summary
Adds an optional `priority` definition to task types which defaults to
`Normal` priority. Updates the task claiming update by query to include
a new scripted sort that sorts by priority in descending order so that
highest priority tasks are claimed first.
This priority field is planned for use by backfill rule execution tasks
so the only usages in this PR are in the functional tests.
Also included an integration test that will ping the team if a task type
explicitly sets a priority in the task definition
---------
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>