fairywren hung in pg_basebackup tests

andrew@dunslane.net2022-07-24T16:27:07+00:00
fairywren (msys2 animal) is currently hung in the pg_basebackup tests. Here's the bottom of the regress log. I don't have further info as yet, but can dig is someone has a suggestion. ### Starting node "main" # Running: pg_ctl -w -D C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_main_data/pgdata -l C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/log/010_pg_basebackup_main.log -o --cluster-name=main start waiting for server to start.... done server started # Postmaster PID for node "main" is 5368 Junction created for C:\tools\nmsys64\tmp\pIBOsSp9se\tempdir <<===>> C:\tools\nmsys64\home\pgrunner\bf\root\REL_15_STABLE\pgsql.build\src\bin\pg_basebackup\tmp_check\tmp_test_jVCb # Taking pg_basebackup tarbackup2 from node "main" # Running: pg_basebackup -D C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_main_data/backup/tarbackup2 -h 127.0.0.1 -p 52897 --checkpoint fast --no-sync -Ft # Backup finished [13:11:33.592](0.978s) ok 95 - backup tar was created [13:11:33.592](0.000s) ok 96 - WAL tar was created [13:11:33.593](0.000s) not ok 97 - one tablespace tar was created [13:11:33.593](0.000s) [13:11:33.593](0.000s) #   Failed test 'one tablespace tar was created' #   at t/010_pg_basebackup.pl line 352. [13:11:33.593](0.000s) #          got: '0' #     expected: '1' # Checking port 52898 # Found port 52898 Name: replica Data directory: C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_replica_data/pgdata Backup directory: C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_replica_data/backup Archive directory: C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_replica_data/archives Connection string: port=52898 host=127.0.0.1 Log file: C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/log/010_pg_basebackup_replica.log # Initializing node "replica" from backup "tarbackup2" of node "main" # Running: C:/Windows/System32/tar xf C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_main_data/backup/tarbackup2/base.tar -C C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_replica_data/pgdata # Running: C:/Windows/System32/tar xf C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_main_data/backup/tarbackup2/pg_wal.tar -C C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_replica_data/pgdata/pg_wal Use of uninitialized value $_[2] in join or string at C:\tools\nmsys64\home\pgrunner\bf\root\REL_15_STABLE\pgsql\src\test\perl/PostgreSQL/Test/Utils.pm line 337. # Running: C:/Windows/System32/tar xf -C C:\tools\nmsys64\home\pgrunner\bf\root\REL_15_STABLE\pgsql.build\src\bin\pg_basebackup\tmp_check\tmp_test_jVCb/tblspc1replica Use of uninitialized value$_[2] in system at C:\tools\nmsys64\home\pgrunner\bf\root\REL_15_STABLE\pgsql\src\test\perl/PostgreSQL/Test/Utils.pm line 338. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
tgl@sss.pgh.pa.us2022-07-24T16:55:56+00:00
Andrew Dunstan <andrew@dunslane.net> writes: > fairywren (msys2 animal) is currently hung in the pg_basebackup tests. > Here's the bottom of the regress log. I don't have further info as yet, > but can dig is someone has a suggestion. Hm, what's with the "Use of uninitialized value" warnings? regards, tom lane
On Sun, Jul 24, 2022 at 12:55:56PM -0400, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > fairywren (msys2 animal) is currently hung in the pg_basebackup tests. > > Here's the bottom of the regress log. I don't have further info as yet, > > but can dig is someone has a suggestion. > > Hm, what's with the "Use of uninitialized value" warnings? The warnings are sequelae of: > > [13:11:33.593](0.000s) not ok 97 - one tablespace tar was created From that, it follows that $tblspc_tars[0] is undef at: PostgreSQL::Test::Utils::system_or_bail($tar, 'xf', $tblspc_tars[0], '-C',$repTsDir); > > # Running: C:/Windows/System32/tar xf  -C > > C:\tools\nmsys64\home\pgrunner\bf\root\REL_15_STABLE\pgsql.build\src\bin\pg_basebackup\tmp_check\tmp_test_jVCb/tblspc1replica I can confirm that Windows tar hangs when invoked that way. For preventing the hang, the test file could die() or skip the tar-program-using section after failing 'one tablespace tar was created'. That still leaves a question about why pg_basebackup didn't make the tablespace tar file. I would check 010_pg_basebackup_main.log for clues about that.
andrew@dunslane.net2022-07-25T13:44:21+00:00
On 2022-07-24 Su 15:10, Noah Misch wrote: > On Sun, Jul 24, 2022 at 12:55:56PM -0400, Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> fairywren (msys2 animal) is currently hung in the pg_basebackup tests. >>> Here's the bottom of the regress log. I don't have further info as yet, >>> but can dig is someone has a suggestion. >> Hm, what's with the "Use of uninitialized value" warnings? > The warnings are sequelae of: > >>> [13:11:33.593](0.000s) not ok 97 - one tablespace tar was created > >From that, it follows that $tblspc_tars[0] is undef at: > > PostgreSQL::Test::Utils::system_or_bail($tar, 'xf', $tblspc_tars[0], > '-C',$repTsDir); > >>> # Running: C:/Windows/System32/tar xf  -C >>> C:\tools\nmsys64\home\pgrunner\bf\root\REL_15_STABLE\pgsql.build\src\bin\pg_basebackup\tmp_check\tmp_test_jVCb/tblspc1replica Perhaps we should have a guard in system_or_bail() and/or system_log() which bails if some element of @_ is undefined. > I can confirm that Windows tar hangs when invoked that way. For preventing > the hang, the test file could die() or skip the tar-program-using section > after failing 'one tablespace tar was created'. That still leaves a question > about why pg_basebackup didn't make the tablespace tar file. I would check > 010_pg_basebackup_main.log for clues about that. The same thing has happened again on HEAD. I don't see anything in that file that gives any clue. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Jul 25, 2022 at 09:44:21AM -0400, Andrew Dunstan wrote: > On 2022-07-24 Su 15:10, Noah Misch wrote: > > On Sun, Jul 24, 2022 at 12:55:56PM -0400, Tom Lane wrote: > >> Andrew Dunstan <andrew@dunslane.net> writes: > >>> fairywren (msys2 animal) is currently hung in the pg_basebackup tests. > >>> Here's the bottom of the regress log. I don't have further info as yet, > >>> but can dig is someone has a suggestion. > >> Hm, what's with the "Use of uninitialized value" warnings? > > The warnings are sequelae of: > > > >>> [13:11:33.593](0.000s) not ok 97 - one tablespace tar was created > > >From that, it follows that $tblspc_tars[0] is undef at: > > > > PostgreSQL::Test::Utils::system_or_bail($tar, 'xf', $tblspc_tars[0], > > '-C',$repTsDir); > > > >>> # Running: C:/Windows/System32/tar xf  -C > >>> C:\tools\nmsys64\home\pgrunner\bf\root\REL_15_STABLE\pgsql.build\src\bin\pg_basebackup\tmp_check\tmp_test_jVCb/tblspc1replica > > Perhaps we should have a guard in system_or_bail() and/or system_log() > which bails if some element of @_ is undefined. That would be reasonable. Also reasonable to impose some long timeout, maybe 10x or 100x PG_TEST_TIMEOUT_DEFAULT, on calls to those functions.
tgl@sss.pgh.pa.us2022-07-25T15:35:12+00:00
Noah Misch <noah@leadboat.com> writes: > On Mon, Jul 25, 2022 at 09:44:21AM -0400, Andrew Dunstan wrote: >> Perhaps we should have a guard in system_or_bail() and/or system_log() >> which bails if some element of @_ is undefined. +1, seeing how hard this is to diagnose. > That would be reasonable. Also reasonable to impose some long timeout, maybe > 10x or 100x PG_TEST_TIMEOUT_DEFAULT, on calls to those functions. Why would it need to be more than PG_TEST_TIMEOUT_DEFAULT? regards, tom lane
On Mon, Jul 25, 2022 at 11:35:12AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Mon, Jul 25, 2022 at 09:44:21AM -0400, Andrew Dunstan wrote: > >> Perhaps we should have a guard in system_or_bail() and/or system_log() > >> which bails if some element of @_ is undefined. > > +1, seeing how hard this is to diagnose. > > > That would be reasonable. Also reasonable to impose some long timeout, maybe > > 10x or 100x PG_TEST_TIMEOUT_DEFAULT, on calls to those functions. > > Why would it need to be more than PG_TEST_TIMEOUT_DEFAULT? We run some long commands, like the parallel_schedule runs. Those currently use plain system(), but they probably should have used system_log() from a logging standpoint. If they had, PG_TEST_TIMEOUT_DEFAULT would have been too short. One could argue that anything that slow should declare its intent to be that slow, but that argument is getting into the territory of a policy change rather than a backstop for clearly-unintended longevity.
tgl@sss.pgh.pa.us2022-07-25T14:52:15+00:00
Andrew Dunstan <andrew@dunslane.net> writes: > On 2022-07-24 Su 15:10, Noah Misch wrote: >> On Sun, Jul 24, 2022 at 12:55:56PM -0400, Tom Lane wrote: >>> Andrew Dunstan <andrew@dunslane.net> writes: >>>> fairywren (msys2 animal) is currently hung in the pg_basebackup tests. >>>> Here's the bottom of the regress log. I don't have further info as yet, >>>> but can dig is someone has a suggestion. >>> Hm, what's with the "Use of uninitialized value" warnings? >> The warnings are sequelae of: >>> [13:11:33.593](0.000s) not ok 97 - one tablespace tar was created >> From that, it follows that $tblspc_tars[0] is undef at: >> PostgreSQL::Test::Utils::system_or_bail($tar, 'xf', $tblspc_tars[0], >> '-C',$repTsDir); Right, so the "glob" failed to find anything. Seeing that this test is new as of 534472375, which postdates fairywren's last successful run, I'd guess that the "glob" needs adjustment for msys path names. regards, tom lane
tgl@sss.pgh.pa.us2022-07-25T15:08:55+00:00
I wrote: > Right, so the "glob" failed to find anything. Seeing that this test > is new as of 534472375, which postdates fairywren's last successful > run, I'd guess that the "glob" needs adjustment for msys path names. Hmm ... an alternative theory is that the test is fine, and what it's telling us is that get_dirent_type() is still wrong on msys. Would that end in this symptom? regards, tom lane
thomas.munro@gmail.com2022-07-25T15:24:13+00:00
On Tue, Jul 26, 2022 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > Right, so the "glob" failed to find anything. Seeing that this test > > is new as of 534472375, which postdates fairywren's last successful > > run, I'd guess that the "glob" needs adjustment for msys path names. The test added by 534472375 is at the end, hundreds of lines later than the one that appears to be failing. > Hmm ... an alternative theory is that the test is fine, and what > it's telling us is that get_dirent_type() is still wrong on msys. > Would that end in this symptom? Hmm, possibly yes (if it sees a non-symlink, it'll skip it). If someone can run the test on an msys system, perhaps they could put a debugging elog() into the code modified by 9d3444dc to log d_name and the d_type that is returned? I'm struggling to understand why msys would change the answer though.
andrew@dunslane.net2022-07-25T16:02:57+00:00
On 2022-07-25 Mo 11:24, Thomas Munro wrote: > On Tue, Jul 26, 2022 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wrote: >>> Right, so the "glob" failed to find anything. Seeing that this test >>> is new as of 534472375, which postdates fairywren's last successful >>> run, I'd guess that the "glob" needs adjustment for msys path names. > The test added by 534472375 is at the end, hundreds of lines later > than the one that appears to be failing. Right. > >> Hmm ... an alternative theory is that the test is fine, and what >> it's telling us is that get_dirent_type() is still wrong on msys. >> Would that end in this symptom? > Hmm, possibly yes (if it sees a non-symlink, it'll skip it). If > someone can run the test on an msys system, perhaps they could put a > debugging elog() into the code modified by 9d3444dc to log d_name and > the d_type that is returned? I'm struggling to understand why msys > would change the answer though. I have no idea either. The link exists and it is a junction. I'll see about logging details. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
thomas.munro@gmail.com2022-07-26T22:31:18+00:00
On Tue, Jul 26, 2022 at 4:03 AM Andrew Dunstan <andrew@dunslane.net> wrote: > On 2022-07-25 Mo 11:24, Thomas Munro wrote: > > On Tue, Jul 26, 2022 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Hmm ... an alternative theory is that the test is fine, and what > >> it's telling us is that get_dirent_type() is still wrong on msys. > >> Would that end in this symptom? > > Hmm, possibly yes (if it sees a non-symlink, it'll skip it). If > > someone can run the test on an msys system, perhaps they could put a > > debugging elog() into the code modified by 9d3444dc to log d_name and > > the d_type that is returned? I'm struggling to understand why msys > > would change the answer though. > > I have no idea either. The link exists and it is a junction. I'll see > about logging details. From the clues so far, it seems like pgwin32_is_junction(fullpath) was returning true, but the following code in get_dirent_type(), which was supposed to be equivalent, is not reached on MSYS (though it apparently does on MSVC?): + if ((fd.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) != 0 && + (fd.dwReserved0 == IO_REPARSE_TAG_MOUNT_POINT)) d->ret.d_type = DT_LNK; pgwin32_is_junction() uses GetFileAttributes() and tests (attr & FILE_ATTRIBUTE_REPARSE_POINT) == FILE_ATTRIBUTE_REPARSE_POINT, which is like the first condition but lacks the dwReserved0 part. What is that part doing, and why would it be doing something different in MSVC and MSYS builds? That code came from 87e6ed7c, recently I was just trying to fix it by reordering the checks; oh, there was some discussion about that field[1]. One idea is that something about dwReserved0 or IO_REPARSE_TAG_MOUNT_POINT is different in the open source replacement system headers supplied by the MinGW project used by MSYS builds (right?), compared to the "real" Windows SDK's headers used by MSVC builds. Or perhaps there is some other dumb mistake, or perhaps the reparse point really is different, or ... I dunno, I'd probably shove a load of log messages in there and see what's going on. [1] https://www.postgresql.org/message-id/flat/CABUevEzURN%3DwC95JHvTKFJtEy0eY9rWO42yU%3D59-q8xSwm-Dug%40mail.gmail.com#ac54acd782fc849c0fe6c2c05db101dc
andrew@dunslane.net2022-07-27T13:47:38+00:00
On 2022-07-26 Tu 18:31, Thomas Munro wrote: > On Tue, Jul 26, 2022 at 4:03 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> On 2022-07-25 Mo 11:24, Thomas Munro wrote: >>> On Tue, Jul 26, 2022 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Hmm ... an alternative theory is that the test is fine, and what >>>> it's telling us is that get_dirent_type() is still wrong on msys. >>>> Would that end in this symptom? >>> Hmm, possibly yes (if it sees a non-symlink, it'll skip it). If >>> someone can run the test on an msys system, perhaps they could put a >>> debugging elog() into the code modified by 9d3444dc to log d_name and >>> the d_type that is returned? I'm struggling to understand why msys >>> would change the answer though. >> I have no idea either. The link exists and it is a junction. I'll see >> about logging details. > >From the clues so far, it seems like pgwin32_is_junction(fullpath) was > returning true, but the following code in get_dirent_type(), which was > supposed to be equivalent, is not reached on MSYS (though it > apparently does on MSVC?): > > + if ((fd.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) != 0 && > + (fd.dwReserved0 == IO_REPARSE_TAG_MOUNT_POINT)) > d->ret.d_type = DT_LNK; > > pgwin32_is_junction() uses GetFileAttributes() and tests (attr & > FILE_ATTRIBUTE_REPARSE_POINT) == FILE_ATTRIBUTE_REPARSE_POINT, which > is like the first condition but lacks the dwReserved0 part. What is > that part doing, and why would it be doing something different in MSVC > and MSYS builds? That code came from 87e6ed7c, recently I was just > trying to fix it by reordering the checks; oh, there was some > discussion about that field[1]. > > One idea is that something about dwReserved0 or > IO_REPARSE_TAG_MOUNT_POINT is different in the open source replacement > system headers supplied by the MinGW project used by MSYS builds > (right?), compared to the "real" Windows SDK's headers used by MSVC > builds. > > Or perhaps there is some other dumb mistake, or perhaps the reparse > point really is different, or ... I dunno, I'd probably shove a load > of log messages in there and see what's going on. > > [1] https://www.postgresql.org/message-id/flat/CABUevEzURN%3DwC95JHvTKFJtEy0eY9rWO42yU%3D59-q8xSwm-Dug%40mail.gmail.com#ac54acd782fc849c0fe6c2c05db101dc dirent.c is not used on msys, only on MSVC. msys is apparently using opendir and friends supplied by the system. What it does if there's a junction I'll try to find out, but it appears that 5344723755 was conceived under a misapprehension about the behaviour of msys. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
andrew@dunslane.net2022-07-27T14:15:54+00:00
alvherre@alvh.no-ip.org2022-07-27T14:24:31+00:00
On 2022-Jul-27, Andrew Dunstan wrote: > The msys dirent.h doesn't have a d_type field at all in a struct dirent. > I can see a number of ways of dealing with this, but the simplest seems > to be just to revert 5344723755, at least for msys, along with a comment > about why it's necessary. Hmm, what other ways there are? I'm about to push a change that duplicates the get_dirent_type call pattern and I was happy about not having that #ifdef there. Not that it's critical, but ... -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
andrew@dunslane.net2022-07-27T14:32:29+00:00
On 2022-07-27 We 10:24, Alvaro Herrera wrote: > On 2022-Jul-27, Andrew Dunstan wrote: > >> The msys dirent.h doesn't have a d_type field at all in a struct dirent. >> I can see a number of ways of dealing with this, but the simplest seems >> to be just to revert 5344723755, at least for msys, along with a comment >> about why it's necessary. > Hmm, what other ways there are? I'm about to push a change that > duplicates the get_dirent_type call pattern and I was happy about not > having that #ifdef there. Not that it's critical, but ... The alternative I thought of would be to switch msys to using our dirent.c. Probably not too hard, but certainly more work than reverting. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com