Deploy monitoring solution with Prometheus and Grafana on premise in HA mode
Deploying Prometheus and Grafana in High Availability (HA) mode on-premises ensures monitoring continuity and data resilience. In this setup, Prometheus will run in HA mode with redundancy and Grafana will connect to the Prometheus instances, also configured for HA.
Below are the steps to deploy Prometheus and Grafana on-premises with HA:
1. Plan the Architecture
- Prometheus Instances:
- Set up at least two Prometheus instances in HA mode for redundancy.
- Each Prometheus instance will scrape the same set of targets independently and store its own local data.
- Grafana Instances:
- Deploy at least two Grafana instances in HA mode, load balanced to ensure availability.
- Grafana will connect to both Prometheus instances and aggregate the metrics.
- Storage:
- Use a distributed storage system like Thanos, VictoriaMetrics, or Prometheus remote storage (like Cortex or Mimir) for long-term data storage.
- Configure a shared storage for Grafana, or use a SQL database (e.g., MySQL, PostgreSQL) to keep dashboards and configuration in sync.
2. Set Up Prometheus in HA Mode
Step 2.1: Install Prometheus
- Download and extract Prometheus on each node:
tar -xvf prometheus-2.37.0.linux-amd64.tar.gz
cd prometheus-2.37.0.linux-amd64
- Copy the Prometheus binary to /usr/local/bin and set up the configuration directory (/etc/prometheus).
Step 2.2: Configure Prometheus
- Create a prometheus.yml configuration file in /etc/prometheus for each instance:
global:
scrape_interval: 15s
scrape_configs:
– job_name: ‘your_targets’
static_configs:
– targets: [‘<target_ip1>:<port>’, ‘<target_ip2>:<port>’]
- For HA, each Prometheus instance must be configured identically with the same scrape targets and rules.
- High Availability Labeling:
- To distinguish between HA Prometheus instances, add a –cluster.peer=<other_instance_ip>:<port> flag in each instance’s configuration.
- This will allow the instances to work as separate, yet synchronized, peers.
Step 2.3: Start Prometheus
- Create a systemd service file for each Prometheus instance at /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus –config.file=/etc/prometheus/prometheus.yml –storage.tsdb.path=/var/lib/prometheus –web.enable-lifecycle
[Install]
WantedBy=multi-user.target
- Enable and start each Prometheus instance:
sudo systemctl enable prometheus
sudo systemctl start prometheus
3. Install and Configure Thanos (Optional for Long-Term Storage)
- Deploy Thanos Sidecar alongside each Prometheus instance for storing data in a distributed store and enabling HA Prometheus queries.
- Thanos Sidecar:
- Set up a sidecar container or service to work with each Prometheus instance.
- It will upload data to an object storage (e.g., S3, MinIO) and enable querying of both Prometheus instances as a unified source.
4. Deploy Grafana in HA Mode
Step 4.1: Install Grafana
- Download and install Grafana on each node:
wget https://dl.grafana.com/oss/release/grafana-8.0.0.linux-amd64.tar.gz
tar -zxvf grafana-8.0.0.linux-amd64.tar.gz
- Copy the Grafana binaries and set up the configuration directory (/etc/grafana).
Step 4.2: Configure Grafana
- In the Grafana configuration file (/etc/grafana/grafana.ini), set up the database to store Grafana data centrally:
[database]
type = postgres
host = <database_host>:5432
name = grafana
user = grafana_user
password = grafana_password
- Add both Prometheus instances as data sources in Grafana. Grafana will automatically handle HA and load balancing between them.
Step 4.3: Start Grafana
- Set up a systemd service for Grafana:
[Unit]
Description=Grafana
After=network.target
[Service]
User=grafana
ExecStart=/usr/local/bin/grafana-server -config /etc/grafana/grafana.ini
[Install]
WantedBy=multi-user.target
- Enable and start Grafana:
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
5. Set Up Load Balancers for HA
- Prometheus Load Balancer:
- Set up a load balancer in front of the Prometheus instances to ensure that requests are evenly distributed across instances.
- Grafana Load Balancer:
- Set up another load balancer for the Grafana instances to distribute user access and enable failover.
6. Verify and Test the HA Setup
- Prometheus:
- Test that both Prometheus instances are running independently by accessing them via <node_ip>:9090.
- Use Thanos Querier (if configured) to query both Prometheus instances as a single source.
- Grafana:
- Log in to Grafana via the load balancer IP, add Prometheus as a data source, and create a sample dashboard.
- Simulate a failure on one Grafana instance and ensure that the other instance handles the load transparently.
7. Enable Monitoring and Alerting
- Configure Alertmanager for Prometheus:
- Set up Alertmanager to handle alerts in case of any issues.
- Use HA by deploying multiple Alertmanager instances with clustering.
- Set up alerts in Grafana for visualization and notifications based on key metrics and alert rules.
Summary of Key Points
- HA Prometheus: Multiple Prometheus instances scraping the same targets, optionally with Thanos for long-term storage and aggregation.
- HA Grafana: Multiple Grafana instances with a centralized database for dashboards, load-balanced to ensure redundancy.
- Alerting: Use Alertmanager in HA mode to handle alerts from Prometheus.
This HA setup for Prometheus and Grafana provides a robust monitoring solution that is resilient, scalable, and fault-tolerant.