mirror of
https://github.com/elastic/kibana.git
synced 2025-06-27 10:40:07 -04:00
Implements a huggingface dataset loader for RAG evals - see [x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md). Additionally, a `@kbn/cache-cli` tool was added that allows tooling authors to cache to disk (possibly remote storage later). Used o3 for finding datasets on HuggingFace and doing an initial pass on a line-by-line dataset processor ([see conversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0)) Libraries added: - `cache-manager`, `cache-manager-fs-hash`, `keyv`, `@types/cache-manager-fs-hash`: caching libraries and plugins. could not find any existing caching libraries in the repo. - `@huggingface/hub`: api client for HF. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2.3 KiB
2.3 KiB
@kbn/cache-cli
Centralised caching helpers for scripts and CLIs in the Kibana repo.
The goal is to make it easy for engineers to cache computationally or I/O expensive operations on disk, or in the future, possible remote.
Quick start
import { fromCache, createLocalDirDiskCacheStore } from '@kbn/cache-cli';
import { createCache } from 'cache-manager';
const DOC_CACHE = createCache({
stores: [createLocalDirDiskCacheStore({ dir: 'my_docs', ttl: 60 * 60 /* 1h */ })],
});
const docs = await fromCache('docs', DOC_CACHE, async () => fetchDocs());
fromCache(key, cache, cb, validator?)
semantics:
- Tries
cache.get(key)
(skipped whenprocess.env.DISABLE_KBN_CACHE
is truthy). - Runs the optional
validator(cached)
– returnfalse
to force a refresh. - Calls
cb()
if the cache miss / invalid. - Persists the fresh value via
cache.set(key, value)
and returns it.
Available cache stores
@kbn/cache-cli
wraps cache-manager
so any Keyv compatible store works. The helpers below ship out-of-the-box:
Helper | Backing store | Typical use-case |
---|---|---|
createLocalDirDiskCacheStore({ dir, ttl? }) |
cache-manager-fs-hash on <REPO_ROOT>/data/{dir} |
Persist in ./data with an unknown ttl |
createTmpDirDiskCacheStore({ dir, ttl? }) |
cache-manager-fs-hash on <OS_TMP_DIR>/{dir} |
Persist in os tmp dir which might be cleared over restarts |
Cache invalidation strategies
- Manual bypass – set
DISABLE_KBN_CACHE=true
to force fresh data (useful in CI workflows). - Time-to-live (TTL) – pass
ttl
when creating a store to let the backend expire entries automatically. - Programmatic validation – supply the
cacheValidator
callback tofromCache()
; it receives the cached value and should returntrue
when it is still valid. - Clear on disk – delete the relevant directory under
data/
if you need a hard reset.
Choose whichever fits your script. They can be combined (e.g. a TTL plus a validator).