RE: [PATCH v3 1/4] Tools: hv: Reopen the devices if read() or write() returns errors

From: Michael Kelley
Date: Sat Jan 25 2020 - 23:49:46 EST


From: Dexuan Cui <decui@xxxxxxxxxxxxx> Sent: Saturday, January 25, 2020 11:54 AM
>
> The state machine in the hv_utils driver can run out of order in some
> corner cases, e.g. if the kvp daemon doesn't call write() fast enough
> due to some reason, kvp_timeout_func() can run first and move the state
> to HVUTIL_READY; next, when kvp_on_msg() is called it returns -EINVAL
> since kvp_transaction.state is smaller than HVUTIL_USERSPACE_REQ; later,
> the daemon's write() gets an error -EINVAL, and the daemon will exit().
>
> We can reproduce the issue by sending a SIGSTOP signal to the daemon, wait
> for 1 minute, and send a SIGCONT signal to the daemon: the daemon will
> exit() quickly.
>
> We can fix the issue by forcing a reset of the device (which means the
> daemon can close() and open() the device again) and doing extra necessary
> clean-up.
>
> Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>
>
> ---
> Changes in v2:
> This is actually a new patch that makes the daemons more robust.
>
> Changes in v3 (I addressed Michael's comments):
> Don't reset target_fd, since that's unnecessary.
> Reset target_fname by: target_fname[0] = '\0';
> Added the missing "fs_frozen = true;" in vss_operate().
> After reopen_vss_fd: if vss_operate(VSS_OP_THAW) can not clear
> fs_frozen due to an error, we just exit().
> Added comments.
>
> tools/hv/hv_fcopy_daemon.c | 33 +++++++++++++++++++++----
> tools/hv/hv_kvp_daemon.c | 36 ++++++++++++++++------------
> tools/hv/hv_vss_daemon.c | 49 +++++++++++++++++++++++++++++---------
> 3 files changed, 88 insertions(+), 30 deletions(-)
>
> diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
> index aea2d91ab364..48cfa032432c 100644
> --- a/tools/hv/hv_fcopy_daemon.c
> +++ b/tools/hv/hv_fcopy_daemon.c
> @@ -80,6 +80,8 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg)
>
> error = 0;
> done:
> + if (error)
> + target_fname[0] = '\0';
> return error;
> }
>
> @@ -108,15 +110,29 @@ static int hv_copy_data(struct hv_do_fcopy *cpmsg)
> return ret;
> }
>
> +/*
> + * Reset target_fname to "" in the two below functions for hibernation: if
> + * the fcopy operation is aborted by hibernation, the daemon should remove the
> + * partially-copied file; to achieve this, the hv_utils driver always fakes a
> + * CANCEL_FCOPY message upon suspend, and later when the VM resumes back,
> + * the daemon calls hv_copy_cancel() to remove the file; if a file is copied
> + * successfully before suspend, hv_copy_finished() must reset target_fname to
> + * avoid that the file can be incorrectly removed upon resume, since the faked
> + * CANCEL_FCOPY message is spurious in this case.
> + */
> static int hv_copy_finished(void)
> {
> close(target_fd);
> + target_fname[0] = '\0';
> return 0;
> }
> static int hv_copy_cancel(void)
> {
> close(target_fd);
> - unlink(target_fname);
> + if (strlen(target_fname) > 0) {
> + unlink(target_fname);
> + target_fname[0] = '\0';
> + }
> return 0;
>
> }
> @@ -141,7 +157,7 @@ int main(int argc, char *argv[])
> struct hv_do_fcopy copy;
> __u32 kernel_modver;
> } buffer = { };
> - int in_handshake = 1;
> + int in_handshake;
>
> static struct option long_options[] = {
> {"help", no_argument, 0, 'h' },
> @@ -170,6 +186,10 @@ int main(int argc, char *argv[])
> openlog("HV_FCOPY", 0, LOG_USER);
> syslog(LOG_INFO, "starting; pid is:%d", getpid());
>
> +reopen_fcopy_fd:
> + /* Remove any possible partially-copied file on error */
> + hv_copy_cancel();

Since you have removed the calls to close(fcopy_fd) after a
pread() or pwrite() failure that were in v2 of the patch, I was
expecting to see

if (fcopy_fd != -1)
close(fcopy_fd)

here, like you've done with the kvp and vss code. And
remember to initialize fcopy_fd to -1. :-)

Michael

> + in_handshake = 1;
> fcopy_fd = open("/dev/vmbus/hv_fcopy", O_RDWR);
>
> if (fcopy_fd < 0) {
> @@ -196,7 +216,7 @@ int main(int argc, char *argv[])
> len = pread(fcopy_fd, &buffer, sizeof(buffer), 0);
> if (len < 0) {
> syslog(LOG_ERR, "pread failed: %s", strerror(errno));
> - exit(EXIT_FAILURE);
> + goto reopen_fcopy_fd;
> }
>
> if (in_handshake) {
> @@ -231,9 +251,14 @@ int main(int argc, char *argv[])
>
> }
>
> + /*
> + * pwrite() may return an error due to the faked CANCEL_FCOPY
> + * message upon hibernation. Ignore the error by resetting the
> + * dev file, i.e. closing and re-opening it.
> + */
> if (pwrite(fcopy_fd, &error, sizeof(int), 0) != sizeof(int)) {
> syslog(LOG_ERR, "pwrite failed: %s", strerror(errno));
> - exit(EXIT_FAILURE);
> + goto reopen_fcopy_fd;
> }
> }
> }