I'm guessing reporting delays such as recently discussed might underly my confusion, laid out below.
With respect to graphs I'd posted about lags in reporting deaths:
shpalman wrote: ↑Wed Dec 23, 2020 9:22 pm
Not sure, the curves look like they're the wrong way around.
What I want is deaths-per-day versus date-of-death for each data set reported on a different day.
I drafted this rambling response yesterday and have just finished it.
So you want:
- x-axis: publication date
- y-axis: number of deaths newly published on this publication date
- colour: date-of-death.
- Numbers from different dates-of-death 'stacked' within publication date, so total 'height' at each publication date = total number of deaths newly published on this publication date. Line joins each date-of-death to corresponding values at adjacent publication dates.
I've been saving published values of "Deaths within 28 days of positive test by date of death" (DateDeaths) (and, incidentally, cases by specimen date) from 17 November until today (27 December), 41 publication dates so far.
My immediate problem is that my data for each publication date doesn't have (and I don't think
https://coronavirus.data.gov.uk/ gives) the
newly published deaths versus date-of-death, just the totals at that publication date for each date-of-death. I addressed that by sorting the data by publication date within date-of-death and calculating successive differences for each date-of-death, e.g. below for date-of-death = 1 December. In this dataframe (dDF) I added column 'Published' when saving the data, and 'diff' to do the required plot.
Code: Select all
date DateDeaths Published diff
11203 2020-12-01 119 2020-12-02 119
11204 2020-12-01 278 2020-12-03 159
11205 2020-12-01 325 2020-12-04 47
11206 2020-12-01 346 2020-12-05 21
11207 2020-12-01 350 2020-12-06 4
11208 2020-12-01 352 2020-12-07 2
11209 2020-12-01 353 2020-12-08 1
11210 2020-12-01 364 2020-12-09 11
11211 2020-12-01 368 2020-12-10 4
11212 2020-12-01 379 2020-12-11 11
11213 2020-12-01 383 2020-12-12 4
11214 2020-12-01 386 2020-12-13 3
11215 2020-12-01 386 2020-12-14 0
11216 2020-12-01 389 2020-12-15 3
11217 2020-12-01 394 2020-12-16 5
11218 2020-12-01 394 2020-12-17 0
11219 2020-12-01 395 2020-12-18 1
11220 2020-12-01 397 2020-12-19 2
11221 2020-12-01 398 2020-12-20 1
11222 2020-12-01 398 2020-12-21 0
11223 2020-12-01 398 2020-12-22 0
11224 2020-12-01 399 2020-12-23 1
11225 2020-12-01 401 2020-12-24 2
11226 2020-12-01 399 2020-12-25 -2
11227 2020-12-01 402 2020-12-26 3
11228 2020-12-01 401 2020-12-27 -1
Two things jump out. Firstly, the number of deaths dated 1 December is still being changed at 27 December.
More surprising, there are some negative differences - on 25 Dec and 27 Dec the number
decreased from the previous day. This isn't unusual. Of the 11,533 differences most are zero, as expected, 1,167 are positive, but 329 are negative.
Code: Select all
> summary(subset(dDF, diff < 0)$diff)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-20.000 -2.000 -1.000 -1.617 -1.000 -1.000
I don't really know what to make of these. Maybe they're corrections of "blunders", but there are many of them, and they're quite widely spread.
Code: Select all
> unique(subset(dDF, diff < 0)$Published)
[1] "2020-11-21" "2020-12-09" "2020-12-12" "2020-11-19" "2020-11-24"
[6] "2020-11-25" "2020-11-30" "2020-12-06" "2020-12-10" "2020-12-18"
[11] "2020-12-23" "2020-12-15" "2020-11-29" "2020-12-01" "2020-12-19"
[16] "2020-12-27" "2020-12-24" "2020-12-08" "2020-12-13" "2020-11-27"
[21] "2020-12-20" "2020-11-22" "2020-12-16" "2020-12-22" "2020-12-25"
[26] "2020-12-26" "2020-11-20" "2020-11-18" "2020-12-03" "2020-12-14"
[31] "2020-12-05" "2020-12-07" "2020-12-17" "2020-12-04" "2020-12-21"
[36] "2020-11-23" "2020-11-28" "2020-12-11" "2020-12-02"
Perhaps (I really don't know) these negative values are related to the mismatch noted below.
I had intended the 'diff' column in dataframe dDF above to be the newly published deaths versus date-of-death needed for the plot, giving me this:
- plot.png (97.1 KiB) Viewed 2566 times
The line colours differ by date-of-death. They look very similar because there are 298 dates-of-death and the most visible ones are quite similar.
But apart from that, there are some very funny features in this graph, especially in the totals at each publication date.
If I sum 'diff' across Published for each date-of-death I get the most recently published deaths by date-of-death (DateDeaths). That has to be the case, it follows from how 'diff' was calculated.
In an ideal world, if I had all the data back to day zero, and summed 'diff' across date-of-death for each Published I should get the most recently published "Deaths within 28 days of positive test by date reported" (PubDeaths). These are the cases when they disagree:
Code: Select all
> subset(bypub,sum.diff != PubDeaths) # mismatches
Pubdate sum.diff PubDeaths
1 2020-11-17 52744 598
2 2020-11-18 528 529
3 2020-11-19 347 501
4 2020-11-20 664 511
5 2020-11-21 219 341
6 2020-11-22 249 398
7 2020-11-23 478 206
8 2020-11-24 607 608
10 2020-11-26 25 498
11 2020-11-27 843 521
12 2020-11-28 496 479
13 2020-11-29 53 215
14 2020-11-30 494 205
15 2020-12-01 607 603
19 2020-12-05 300 397
20 2020-12-06 70 231
21 2020-12-07 446 189
22 2020-12-08 601 616
23 2020-12-09 531 533
24 2020-12-10 512 516
25 2020-12-11 397 424
26 2020-12-12 422 519
27 2020-12-13 30 144
28 2020-12-14 476 232
29 2020-12-15 390 506
30 2020-12-16 726 613
31 2020-12-17 533 532
33 2020-12-19 426 534
34 2020-12-20 120 326
35 2020-12-21 530 215
36 2020-12-22 689 691
37 2020-12-23 742 744
38 2020-12-24 455 585
39 2020-12-25 219 570
40 2020-12-26 47 230
41 2020-12-27 44 316
The first, very large, disagreement is expected, I don't have publication dates prior to 17 November so all deaths up to then are assigned to that publication date, which I've omitted from the graph. But there are large disagreements later,
I had hoped that a given date-of-death would become constant after 2 or 3 weeks, giving zero differences, no longer contributing to sums for publication dates. But, as an example, here are the non-zero contributions to the 13 December publication date.
Code: Select all
> subset(dDF, (Published == as.Date("2020-12-13")) & (diff != 0))
date DateDeaths Published diff
2897 2020-05-09 377 2020-12-13 -1
3307 2020-05-19 275 2020-12-13 1
3348 2020-05-20 267 2020-12-13 -1
9826 2020-10-25 249 2020-12-13 1
10400 2020-11-08 412 2020-12-13 -1
10605 2020-11-13 433 2020-12-13 -1
10687 2020-11-15 449 2020-12-13 -1
10728 2020-11-16 423 2020-12-13 -1
10879 2020-11-20 449 2020-12-13 -1
10949 2020-11-22 474 2020-12-13 -1
11047 2020-11-25 477 2020-12-13 -1
11077 2020-11-26 431 2020-12-13 1
11105 2020-11-27 415 2020-12-13 -1
11133 2020-11-28 453 2020-12-13 1
11188 2020-11-30 419 2020-12-13 1
11214 2020-12-01 386 2020-12-13 3
11284 2020-12-04 433 2020-12-13 3
11305 2020-12-05 347 2020-12-13 1
11326 2020-12-06 365 2020-12-13 1
11346 2020-12-07 370 2020-12-13 2
11383 2020-12-09 356 2020-12-13 5
11400 2020-12-10 313 2020-12-13 20
These go back to date-of-death 9 May. There doesn't seem to be a time after which the influence of a date-of-death on a sum for publication date disappears.
But that isn't enough to explain the differences. According to
https://coronavirus.data.gov.uk/details ... itive-test "Deaths by date reported - each death is assigned to the date when it was first included in the published totals", so a "Deaths by date reported" figure shouldn't include any deaths published earlier.
As an example, on 13 December the "Deaths by date reported" was 144. But look at the by date-of-death data published on 12 and 13 December, here are all the dates-of-death where the numbers differed between the two publication dates. There just aren't 144 deaths included on 13 December which weren't included earlier.
Code: Select all
> tDF <- data.frame(subset(dDF, (Published == as.Date("2020-12-12")), 1:3),
subset(dDF, (Published == as.Date("2020-12-13"))), check.names = TRUE)
> subset(tDF, diff !=0)
date DateDeaths Published date.1 DateDeaths.1 Published.1 diff
2896 2020-05-09 378 2020-12-12 2020-05-09 377 2020-12-13 -1
3306 2020-05-19 274 2020-12-12 2020-05-19 275 2020-12-13 1
3347 2020-05-20 268 2020-12-12 2020-05-20 267 2020-12-13 -1
9825 2020-10-25 248 2020-12-12 2020-10-25 249 2020-12-13 1
10399 2020-11-08 413 2020-12-12 2020-11-08 412 2020-12-13 -1
10604 2020-11-13 434 2020-12-12 2020-11-13 433 2020-12-13 -1
10686 2020-11-15 450 2020-12-12 2020-11-15 449 2020-12-13 -1
10727 2020-11-16 424 2020-12-12 2020-11-16 423 2020-12-13 -1
10878 2020-11-20 450 2020-12-12 2020-11-20 449 2020-12-13 -1
10948 2020-11-22 475 2020-12-12 2020-11-22 474 2020-12-13 -1
11046 2020-11-25 478 2020-12-12 2020-11-25 477 2020-12-13 -1
11076 2020-11-26 430 2020-12-12 2020-11-26 431 2020-12-13 1
11104 2020-11-27 416 2020-12-12 2020-11-27 415 2020-12-13 -1
11132 2020-11-28 452 2020-12-12 2020-11-28 453 2020-12-13 1
11187 2020-11-30 418 2020-12-12 2020-11-30 419 2020-12-13 1
11213 2020-12-01 383 2020-12-12 2020-12-01 386 2020-12-13 3
11283 2020-12-04 430 2020-12-12 2020-12-04 433 2020-12-13 3
11304 2020-12-05 346 2020-12-12 2020-12-05 347 2020-12-13 1
11325 2020-12-06 364 2020-12-12 2020-12-06 365 2020-12-13 1
11345 2020-12-07 368 2020-12-12 2020-12-07 370 2020-12-13 2
11382 2020-12-09 351 2020-12-12 2020-12-09 356 2020-12-13 5
11399 2020-12-10 293 2020-12-12 2020-12-10 313 2020-12-13 20
Sorry for the loser-length post. I feel I must be misunderstanding something, probably something very simple - it has been know before! Can anyone put me straight?