[DPE-4114] Test: Scale to zero units#347
[DPE-4114] Test: Scale to zero units#347BalabaDmitri wants to merge 27 commits intocanonical:mainfrom
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #347 +/- ##
==========================================
- Coverage 80.31% 79.94% -0.37%
==========================================
Files 10 10
Lines 2301 2169 -132
Branches 376 344 -32
==========================================
- Hits 1848 1734 -114
+ Misses 369 368 -1
+ Partials 84 67 -17 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
@BalabaDmintri Thank you for the contribution. We need to add a ticket number and PR body describing What?Where?Why? better.
Re:PR. It is overall good start!!! We need to cover all scaling cases:
re-scaling-back with and without drive. E.g. the default hostpath storage will remove drive, so re-scaling will start on empty disk. In the same time btrfs will keep disk available and re-scaling will be with a disk (could be default disk or manually provided). Could be wrongly provided...
I mean we have those cases:
- user wants to restore the same storage:
remove-unit postgresql/3 && && add-unit -n 2should reuse old storage as user didn't specify it (it can be new disk, e.g. hostpath). => should be OK, long SST/clone process for 2nd+ node, but 0->1 is new DB init as no disk. - user wants to provide specific correct storage => should be OK automatically, just fast resilvering.
- user made mistake and specified wrong storage (another cluster) => charm blocked. do NOT corrupt foreign disks!!!
- user made mistake and specified wrong storage (empty/corrupter/no psql data on it) => block to be on a safe side. force user to remove disk and re-scale without any disks.
- ...
I believe we should test all those cases above.
P.S. Charm should be safe and block bootstrap if found foreign cluster on disk, etc (probable improvement here required).
P.P.S. on top of this: it can be disk from the different charm revision OR different PostgreSQL version. We need to test it too. block or accept is a question to @7annaba3l :-D
|
|
||
| # Scale the database to three units. | ||
| for store_id in storage_id_list: | ||
| await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME) |
There was a problem hiding this comment.
JFMI, should re use ops lib directly as in helper it refers to this which is resolved:
Note: this function exists as a temporary solution until this issue is resolved:
https://github.com/juju/python-libjuju/issues/695
There was a problem hiding this comment.
Yes. We should remove the workaround and use the methods provided by the lib.
There was a problem hiding this comment.
IIRC this is only available in libjuju 3
|
BTW, @BalabaDmintri you need to sign CLA and fix lint tests here: Please check https://github.com/canonical/postgresql-operator/blob/main/CONTRIBUTING.md |
| channel="edge", | ||
| ) | ||
|
|
||
| # Deploy the continuous writes application charm if it wasn't already deployed. |
There was a problem hiding this comment.
This part can be removed, as the continuous writes application is already deployed by test_build_and_deploy.
| ) | ||
|
|
||
| if wait_for_apps: | ||
| await ops_test.model.wait_for_idle(status="active", timeout=3000) |
There was a problem hiding this comment.
After the above comment is handled, this line can be moved close to the deployment of the PostgreSQL application.
|
|
||
| # Scale the database to three units. | ||
| for store_id in storage_id_list: | ||
| await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME) |
There was a problem hiding this comment.
Yes. We should remove the workaround and use the methods provided by the lib.
|
|
||
| # Scale the database to one unit. | ||
| logger.info("scaling database to one unit") | ||
| await add_unit_with_storage(ops_test, storage=primary_storage, app=APP_NAME) |
There was a problem hiding this comment.
After the unit starts, we should check if the data on the storage has been actually restored.
| # Scale the database to three units. | ||
| for store_id in storage_id_list: | ||
| await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME) | ||
| await check_writes(ops_test) No newline at end of file |
There was a problem hiding this comment.
After 2nd and 3rd units start, it is needed to check that data on them is restored from WAL (not via backup/restore).
Maybe @dragomirp or @marceloneppel know how to check this
There was a problem hiding this comment.
ha_test.helpers.reused_replica_storage() and ha_test.helpers.reused_full_cluster_recovery_storage() should do the trick.
…nits, check no errors, scale to 3, check
…nits, check no errors, scale to 3, check
…nits, check no errors, scale to 3, check
f9113e2 to
a18b1d3
Compare
|
Discussed this today on the sync. For the history:
Some example in my mind:
|
3b785a2 to
a63da74
Compare
a63da74 to
d917d88
Compare
4738de9 to
0a0486f
Compare
|
@marceloneppel Can you run actions |
| async def get_db_connection(ops_test, dbname, is_primary=True, replica_unit_name=""): | ||
| unit_name = await get_primary(ops_test, APP_NAME) |
There was a problem hiding this comment.
You may add the type hint for the returned values and a docstring to make the output even easier to understand. The same also applies to the other functions you created in this file.
a39b446 to
6873326
Compare
taurus-forever
left a comment
There was a problem hiding this comment.
LGTM, this is something we were missing. Tnx!
| logger.info("database scaling up to two units using third-party cluster storage") | ||
| new_unit = await add_unit_with_storage( | ||
| ops_test, app=app, storage=second_storage, is_blocked=True | ||
| ) |
There was a problem hiding this comment.
nit: IMHO, it worth to check: we are blocked with the right message (foreign disk).
|
@dragomirp Can you check |
|
Hi, @BalabaDmitri, can you resync with main again, to retrigger the CI and to get some fixes. Sorry for asking again. |
|
Hi, @BalabaDmitri, the new test seems to frequently fail with: E.g. here I've also seen it fail the assertion for writes continuing: E.g. here Please, take another look. |
e670781
Issue #445
Solution
Test coverage of the following cases:
Test not coverage of the following cases:
Implementation