Pre Production Checklist

By Owen Diehl

May 22, 2022

I get asked this all the time, so here’s a list of things to take care of before running Loki in a production environment:

Collect it’s own logs

Make sure Loki is ingesting it’s own logs. This can be done just like you would for any other application (i.e. via promtail) and ensures that when something goes wrong, you’ll have logs available to debug. For an even more reliable setup, log them to a different Loki cluster, but this may not be feasible depending on your scale/investment.

Word of warning: do not use --log.level=debug for promtail. This causes promtail to log every line it ingests for debuggin purposes, but when hooked up to a loki cluster, it can cause an infinite loop! This ends up being a kind of fork bomb.

Stop using the file system

Loki has a filesystem storage backend, but this is primarily for local development and proof of concepts. When you’re ready to run Loki in production, ensure you’re not storing data on the file system locally for a few reasons:

This breaks durability guarantees: if the disk is lost, so is all the data.
The file system does not scale as it’s only mounted to a single machine. This means Loki is limited by the size of the machine it runs on.
Other machines in a Loki cluster won’t be able to see this data, meaning we can’t parallelize or scale across the cluster.

Instead, use one of the many object store backends Loki supports, such as Google’s gcs, AWS’s s3, Microsoft’s azure blob storage. If you want/need to run on premesis and don’t have access to a cloud provider’s object storage, take advantae of MinIO’s s3 compatibility: run a MinIO cluster yourself and tell Loki it’s s3.

In my experience, choosing a cloud provider’s object storage is much better than running your own.

Ensure your data is durable

Loki uses two main concepts to ensure that logs aren’t lost once they’re accepted:

Replication. Loki can optionally replicate all ingested data across multiple nodes. I say optional, but I consider this a requirement for production clusters. It protects against data-loss in the case of node failure. This is configurable, but I suggest a replication_factor of 3. Additionally, replication ensures that data is still query-able during an upgrade when one node may be restarting.
a Write Ahead Log (WAL). This is a technique used by many databases for durability. Loki appends all accepted writes to a log file. Upon restart, Loki will “replay” this log file to ensure that data wasn’t lost.

You should turn both of these features on (replication_factor=3).