Loki's new TSDB Index
By Owen Diehl
TSDB Birds Eye View
Growing pains are common and databases are no exception. A year and a half ago, the Loki team started talking about how to approach order of magnitude improvements in cardinality, query throughput, and reliability. As we put the finishing touches on our new index layer, let’s take a look at how we’re trying to stay ahead of the curve.
Loki’s new index is built atop a modified version of TSDB. TSDB is a building block of Prometheus and is highly optimized for routing a set of label matchers ({job="api", cluster="us-central", environment!="prod"}
) to the corresponding data.
Architecture
TSDB Heads and Write Ahead Logs
Let’s look at how these new pieces of Loki are architected. First, how we build TSDB indices in Loki. TSDBs are highly performant databases which allow us to query for streams/chunks by their labels. However, they’re immutable and must be built before they can be queried. This doesn’t match our ingestion model where logs can be pushed anytime and chunks flushed as soon as they’re ready.
To deal with this problem, we use a mutable TSDB “HEAD” which can be appended to incrementally and queried immediately. It’s not as efficient, but only needs to serve recent data that doesn’t exist in a TSDB yet. We also use a write-ahead-log (WAL) to capture chunk flushes on disk. This allows us to recover data after a crash by replaying the WAL.
TSDB Building, Rotation
a TSDB Manager is employed to periodically (15m) build TSDBs from the accumulated WALs. When successful, it “rotates” out the old TSDB Head and WAL, deleting them to free up memory+disk and replacing them with new, empty versions. This prevents unbounded growth of the less efficient pieces and replaces them with a freshly built TSDB.
The new TSDBs are performant, multitenant indices ready for immediate use. They’re shipped to remote storage for later use by index-gateways or queriers, but temporarily kept around locally so ingesters can serve queries in the meantime.
Tenancy and Compaction
The TSDBs built on the ingesters are multi-tenant. They’re built frequently (every 15m per ingester) and it would be infeasible to build per tenant indices when each ingester can hold thousands of tenants. We utilize the compactor to turn a bunch of short-term, multi-tenant indices into longer-term single-tenant ones. Because this is a regular occurrence, index-gateways|queriers must be able to query both the pre-compaction multitenant indices as well as the post-compaction single tenant ones.
The compactor is technically a non-essential efficiency gain, meaning periods of downtime can be tolerated. However, it’s an important benefit. Let’s look at the tradeoff in a pre vs post-compaction environment for a 24h query in a cluster with 4000 tenants and 100 ingesters:
24h * 15m tsdb build interval * 100 ingesters
= 24 * 4 * 100
= 9600 index files with 4000 tenants' data each
vs
1 of 4000 per-tenant indices covering 1d each
In the post-compacted example, we only need to query a single index file with the tenant we care about rather than 9600 index files containing data from all 4000 tenants!
Note: This doesn’t even include the benefits of deduplication, as compacted indices remove multiple references to the same chunks created by Loki’s replication factor
Query Planning
The next piece of TSDB we’ll look at is query planning. Much of Loki’s existing performance takes advantage of many stages of query planning, most notably
- Splitting: Querying time-offsets in parallel such as 1h intervals and merging the results.
- Sharding: A set of optimizations where queries are rewritten into more parallelizable forms, most usually by querying non-overlapping subsets (shards) of the data independently and merging them. This is another “dimension” which we can parallelize, combining with the prior time-splitting to create the following:
Time-splitting is configurable per tenant, meaning a smaller tenant may use 1h
splits while a larger one may use 30m
split to parallelize more over a period of time. Sharding is historically constant across a period config in a cluster. This makes finding optimal parallelism difficult. The relationship between time splits, cluster shard factors, and data throughput must be continually re-evaluated per tenant to find the best configurations. Even then, we’ll still over parallelize small queries from large tenants and under parallelize large queries from small tenants! This isn’t a good system - it’s too inflexible and requires continual tuning.
Loki’s new index supports index sampling and dynamic sharding.
Index Sampling
Index sampling means we can know how much data exists before we query it. Specifically, we can query data topology from the index alone, based on the new chunk statistics embedded in the index itself. There’s also a new API endpoint for building tools on top of this!
This enables the following:
{job="foo", env!="dev"} =>
{
"streams": 100,
"chunks": 1000,
"entries": 5000,
"bytes": 100000,
}
The query frontend component will now detect how much data each query needs to find an optimal parallelism factor for _each request**. This means faster, more flexible queries which don’t over or under consume querier parallelism:
Dynamic Sharding
Our modified TSDB also supports dynamic sharding. The old index hardcoded a shard factor into it’s internal structure. This meant we had to either avoid sharding requests or shard them at the same factor the index used. TSDB allows us to shard at any power of two, meaning we can shard down to the closest value that gives us an optimal bytes/query
.
Scheduling
There are a number of benefits to scheduling this way. It unlocks query throughputs much higher than previously achievable. Perhaps more importantly, it improves work distribution. If we shard a request 16x regardless of the underlying data size, it’s likely that small queries will take up querier workers they didn’t need to (a waste). On the other hand, large queries may not be parallelized enough. This is a big problem. Large queries can bottleneck or cause queriers to OOM due to sending them too much work. To account for this, we must overprovision querier resources so they can survive the occasional cpu or memory burst.
Smaller, more consistent subqueries mean lower TCO & better SLOs
- TCO
- Less overprovisioning = run leaner w/ less wasted resources
- SLOs
- Less “query of death”. Consistent workloads mean we’re less likely to OOM a querier, then OOM it’s neighbor when the query is rescheduled, etc.
TSDB Structure
This section can be reliably skipped unless you’re curious about how TSDB works and what we’ve changed from Prometheus’ TSDB.
At it’s simplest, the TSDB index is a binary format which stores a set of series, their associated chunks, and an inverted index (skipped in this section as we haven’t changed this part).
Series
The series table in TSDB stores set of series. Each series is referenced by it’s byte offset in this table as an ID. Prometheus’ TSDB sorts them by their label sets lexicographically, but we’ll sort them by the hash of their label set, for reasons explained in the sharding section.
Chunks
Each series stores a list of chunks associated with it. We use an array of ChunkMeta
s:
// Meta holds information about a chunk of data.
type ChunkMeta struct {
Checksum uint32
MinTime, MaxTime int64
// Bytes stored, rounded to nearest KB
KB uint32
Entries uint32
}
Sharding
Loki’s TSDB is natively shardable. What does that mean? Let’s take a look:
First off, the numbers:
// factor 2 runs in 1/2 the time
Query_GetChunkRefsSharded/match_ns-2 33.9ms ± 1% 17.1ms ± 1% -49.66% (p=0.000 n=19+17)
// factor 4 runs in 1/4 the time
Query_GetChunkRefsSharded/match_ns-4 47.7ms ± 2% 11.8ms ± 3% -75.33% (p=0.000 n=20+19)
// factor 8 runs in 1/8 the time
Query_GetChunkRefsSharded/match_ns-8 72.3ms ± 2% 9.2ms ± 2% -87.34% (p=0.000 n=20+18)
// factor 16 runs in 1/16 the time
Query_GetChunkRefsSharded/match_ns-16 119ms ± 3% 7ms ± 1% -93.84% (p=0.000 n=18+20)
// factor 32 runs in 1/30 the time
Query_GetChunkRefsSharded/match_ns-32 212ms ± 1% 7ms ± 1% -96.64% (p=0.000 n=20+19)
30x? really? Ok, ok, you’ve caught me. We’ve actually made requesting an invidual shard of data in TSDB faster, linearly proportional to the shard size chosen. Querying all shards should still be the same speed. This also means that 30 isn’t the upper bound; it’s just the highest shard factor I’ve tested.
Previously, to limit a TSDB query to a specific shard, we’d query the whole index and filter out all unneeded shards. For a factor of 16, this means we’re resolving 16x the data we need then throwing away 15 times as much data as we keep.
So, how does this work?
Membership
Historically, we used hash(labelset) % shard_factor
to determine which shard a series belonged to. This is effective, but a sorted list of hashes does not correspond to a sorted list of shards. The following hashes yields the following shards (factor 16):
hash (base10) | hash (binary) | shard |
---|---|---|
0 | 0 | 0_of_16 |
5 | 101 | 5_of_16 |
16 | 10000 | 0_of_16 |
17 | 10001 | 1_of_16 |
This means to find all items with a certain shard factor, we need to scan all the values and filter out all the results of incompatible shards.
Instead of using modulos, let’s use bit prefixes. Given a factor of two, all hashes which start with a 0
bit belong to the first shard, and all hashes which start with a 1
bit belong to the second. This works for any factor of two, it just requires checking more leading bits! Let’s look at a factor of 4 instead:
bit prefix | shard |
---|---|
00 | 0_of_4 |
01 | 1_of_4 |
10 | 2_of_4 |
11 | 3_of_4 |
Using this algorithm, a sorted list of hashes is a sorted list of shards for any shard factor!
hash (binary) | shard (factor 2) | shard (factor 4) |
---|---|---|
000 | 0_of_2 | 0_of_4 |
010 | 0_of_2 | 1_of_4 |
100 | 1_of_2 | 2_of_4 |
101 | 1_of_2 | 2_of_4 |
110 | 1_of_2 | 3_of_4 |
111 | 1_of_2 | 3_of_4 |
This allows Loki to use binary search to quickly find the relevant part of the index and skip the irrelevant parts:
Tada! That’s it.
This does have a few caveats:
- All shard factors must be a factor of 2
- Iterators over TSDB are now returned in fingerprint order rather than lexicographically
For those who want to follow along more closely, see the tracking issue.
Future Work
Index Queries
The chunk data embedded in TSDB paves the way for future improvements in index-only or index-accelerated queries. These are queries which don’t need to touch the underlying log data, but can be completed via the statistics information in the index alone. This currently powers both the query planning and the /series
endpoint in Loki’s new index, but there are many potential applications, including:
- Feedback on query size before running
- Stream, Throughput, and Cardinality analysis
- Graceful downgrading of large queries to index-only approximations
These improvements allow us to run at higher cardinality & byte scale more reliably with higher query throughputs and better TCO.
Fin.