Metadata Repository Failover Certification for PostgreSQL Replication and Automatic Failover

Quality Architect • MicroStrategy

This document provides information for customers to assess and quantify the Metadata repository failover with PostgreSQL replication and automatic failover solution.

Starting with the release of Strategy ONE (March 2024), dossiers are also known as dashboards.

Test Methodology

Our test scripts use Apache Jmeter* simulations to replicate the typical workflows of users accessing Strategy – e.g., multiple users log on, run dossiers, and perform various manipulations.

Test Application

Universal Benchmark project is a standard benchmark application we used for our tests. The dossier has multiple chapters and pages covering different type of visualizations and data aggregation.

Query Metadata Workflows: Get user library and dossier execution which invoke Metadata Read operation only.
Update Metadata Workflows: Library dossier execution and manipulation which invoke common Metadata Read and Write Operation.
Login Metadata Workflows: Pure login to Library with different user accounts which invoke light and quick Metadata Read operation only.

Test Infrastructure

A network proxy (Linux traffic control utility iproute-tc) has been verified that can inject latency/throttle throughput to simulate geographically distributed deployment with latencies accurately. The proxy is set up on the Secondary PostgreSQL server to mimic the West Region and East Region latency.

Test Environment

Strategy Intelligence Server and Library:

AWS R4 Instance r4.4xlarge
16vCPU, 122GiB Memory

PostgreSQL Server:

PostgreSQL 14
Standard authentication
RHEL 7.9 Linux Server

PostgreSQL high availability solution:

3rd party tool pg_auto_failover to support PostgreSQL Replication and Automatic Failover
Synchronous mode for Primary PostgreSQL server and Secondary PostgreSQL server
PostgreSQL native ODBC driver
Refer to this KB article for the detailed setup.

Assumptions

Our test result is conducted on above specific test application and test environment. We are trying to make conclusions based on the regular patterns we found in our test results.
Our tests run on a single machine environment which includes the web application server and intelligence server.

Conclusions

1, What is the impact of higher latency on Query Metadata Workflows on Single node I-Server?

No functional nor performance impact.

2, What is the impact of higher latency on Update Metadata Workflows on Single node I-Server?

While higher latency does impact the performance of Update Metadata Workflows, the observed degradations are still within acceptable limits and are unlikely to significantly affect the overall functionality and usability.
No functional impact.
When comparing 50ms latency to 1ms latency between two high availability Metadata Repository Servers:

The average response time experienced a degradation of 5.2%, resulting in an absolute degradation of within 0.15 seconds (from 2.88s to 3.03s).
The throughput, measured in requests per minute, showed a degradation of 5.6%.

When comparing 100ms latency to 1ms latency between two high availability Metadata Repository Servers:

The average response time experienced a degradation of 7.29%, resulting in an absolute degradation of within 0.21 seconds (from 2.88s to 3.03s).
The throughput, measured in requests per minute, showed a degradation of 12.56%.

3, What is the failover performance with Login Metadata Workflows on Single node I-Server?

It takes about 12 seconds for the PostgreSQL monitor node to detect unhealthy node. It will take another 33 seconds to promote the secondary PostgreSQL server to Primary.
During the PostgreSQL failover process (~45 seconds), all the requests will fail.
After the PostgreSQL failover is finished successfully, the end user experience will depend on the Metadata connection timeout setting.

With the default 300 seconds time-out setting, the end user will wait 120 seconds for the request time-out first, and then log in successfully.

With the changed 10 seconds time-out setting, the end user will wait ~10 seconds to log in successfully.
Refer to KB Draft - How to improve performance after Primary PostgreSQL Metadata Repository is down during Failover for solution 1.
Below is the user experience video for your reference.

If all the 3 solutions mentioned in KB Draft - How to improve performance after Primary PostgreSQL Metadata Repository is down during Failover are applied, after the PostgreSQL failover finished successfully, the end user will not experience any difference as before failover. The login request will be successful within 1 second.

4, What is the failover performance with Login Metadata Workflows on 2-nodes Cluster I-Servers?

The failover performance of PostgreSQL server side is similar as Conclusion 3.
After the PostgreSQL failover is finished successfully, the end user may experience 2~16 minutes recovery time depending on the # of cached Metadata connections.
Known limitations and workaround: KB Draft - How to improve performance after Primary PostgreSQL Metadata Repository is down during Failover
If all the 3 solutions mentioned in the above KB are applied, after the PostgreSQL failover is finished successfully, the end user will not experience any difference as before the failover. The login request will be successful within 1 second.

Check out the attached user experience video for your reference.

5, What is the failover performance with Update Metadata Workflows on Single node I-Server?

The failover performance of PostgreSQL server side is similar as Conclusion 3.
After the PostgreSQL failover is finished successfully, the end user may experience 2~40 minutes recovery time depending on the manipulation types.
This is a known limitation and please refer to the following KB to apply the workaround: KB - Library will keep polling I-Server for 10 minutes until timeout if DB server is down when performing document manipulations

Check out the attached user experience video for your reference.

6, What happens when primary PostgreSQL comes back up after failure?

6.1, What it takes for the pre-primary Metadata Repository Server to serve as primary again?
The current pg_auto_failover solution requires manual intervention to bring the node to be primary again.

6.2, How long does it take for the pre-primary Metadata Repository Server get caught up on missing transactions?
When comparing 50ms latency to 1ms latency between two high availability Metadata Repository Servers, for a total of 5.4GB file to copy, the recovery time for the pre-primary Metadata Repository Server will increase from 31 seconds to 133 seconds.
The recovery time is influenced by two primary factors:

The total file size to copy, mainly includes two items:

The size of the whole
```
pg_wal
```
log folder.
The size of the incremental
```
base
```
folder (storing database table).

The copy speed between the two machines: like network latency, IO throughput, etc

Comment

0 comments

Details

Knowledge Article

Published:

June 26, 2023

Last Updated:

March 21, 2024

Test Methodology

Our test scripts use Apache Jmeter* simulations to replicate the typical workflows of users accessing Strategy – e.g., multiple users log on, run dossiers, and perform various manipulations.

Test Application

Universal Benchmark project is a standard benchmark application we used for our tests. The dossier has multiple chapters and pages covering different type of visualizations and data aggregation.

Query Metadata Workflows: Get user library and dossier execution which invoke Metadata Read operation only.

Update Metadata Workflows: Library dossier execution and manipulation which invoke common Metadata Read and Write Operation.

Login Metadata Workflows: Pure login to Library with different user accounts which invoke light and quick Metadata Read operation only.

Test Infrastructure

Test Environment

Strategy Intelligence Server and Library:

AWS R4 Instance r4.4xlarge

16vCPU, 122GiB Memory

PostgreSQL Server:

PostgreSQL 14

Standard authentication

RHEL 7.9 Linux Server

PostgreSQL high availability solution:

3rd party tool pg_auto_failover to support PostgreSQL Replication and Automatic Failover

Synchronous mode for Primary PostgreSQL server and Secondary PostgreSQL server

PostgreSQL native ODBC driver

Refer to this KB article for the detailed setup.