shpalman wrote: ↑Wed Dec 30, 2020 9:12 am
I noticed, looking at the raw data with KAJ kindly sent me, was that not all reports had data all the way up to the day before the report date. This seems to have happened every weekend in fact. For example, the data from the 4th of December reports data up to the 3rd, but the data on the 5th and 6th also stops on the 3rd, then the data on the 7th goes all the way up to the 6th. This is probably that issue I mentioned previously, that if not all four nations in the UK report a number for the date, the UK total just isn't given, and they are filled in after the weekend. Those numbers count as being reported on the 5th or 6th even if the UK doesn't find out about them until the 7th.
Yes. As downloaded a single days data is in "wide" format e.g. some of the columns I downloaded yesterday:
Code: Select all
> aDF[1:20, c(1,10:12)]
date PubDeaths DateDeaths cumDateDeaths
1 2020-12-28 357 NA NA
2 2020-12-27 316 NA NA
3 2020-12-26 230 NA NA
4 2020-12-25 570 NA NA
5 2020-12-24 585 NA NA
6 2020-12-23 744 NA NA
7 2020-12-22 691 456 69818
8 2020-12-21 215 505 69362
9 2020-12-20 326 437 68857
10 2020-12-19 534 424 68420
11 2020-12-18 489 448 67996
12 2020-12-17 532 495 67548
13 2020-12-16 613 404 67053
14 2020-12-15 506 447 66649
15 2020-12-14 232 471 66202
16 2020-12-13 144 406 65731
17 2020-12-12 519 432 65325
18 2020-12-11 424 436 64893
19 2020-12-10 516 447 64457
20 2020-12-09 533 427 64010
This has one row (date) for each date for which there is any data, implying missing values (NA) where there is no data.
By the way, often (usually!) NA and zero should be treated differently. In the case of "Deaths by date-of-death" (DateDeaths) that is debatable. I read this example as meaning that no deaths have (yet!) been reported for 23-28 December. But it is recognised that DateDeaths figures may be incomplete and are subject to change in future publication date, so if NA means "none yet" then NA means zero. But that's by the way.
Moving on, I store the accumulating data (and sent it to you) in a "stacked" or "long" 3-column format. The first two columns are as published, but I omitted rows with NA for DateDeaths (as they hold no information). I added column "Published" recording which days downloads the data came from. This layout makes it easy (inter alia) to calculate how long after a given date-of-death the given number of deaths was reached; lag = Published - date.
Code: Select all
date DateDeaths Published
2400 2020-12-06 53 2020-12-07
3102 2020-12-05 142 2020-12-07
4104 2020-12-04 231 2020-12-07
5104 2020-12-03 346 2020-12-07
4103 2020-12-03 337 2020-12-06
3101 2020-12-03 294 2020-12-05
<snip>
Your comments are easiest interpreted by looking at an "unstacked" or "wide" format with data from different publication dates in different columns. This requires NA values (subject to my 'by the way' above, but also noting that date-of-death > Publication date really is NA) e.g.
Code: Select all
date 2020-12-07 2020-12-06 2020-12-05 2020-12-04
1 2020-12-06 53 NA NA NA
2 2020-12-05 142 NA NA NA
3 2020-12-04 231 NA NA NA
4 2020-12-03 346 337 294 119
5 2020-12-02 335 328 320 280
6 2020-12-01 352 350 346 325
You say "if not all four nations in the UK report a number for the date, the UK total just isn't given, and they are filled in after the weekend. Those numbers count as being reported on the 5th or 6th even if the UK doesn't find out about them until the 7th".
But I'm afraid that contradicts the definition of "Deaths by date reported" which is
linkDeaths by date reported - each death is assigned to the date when it was first included in the published totals.
The numbers count as being reported on the date they are published, which cannot be before the UK finds out about them. If they are included in the "Deaths by date reported" I cannot see why they are not included in the "Deaths by date of death".
As I understand it, each death has (inter alia) three attributes:
- Location
- Date of death
- Date of first publication (possibly blank for a time)
Location controls which collections (UK, England, Midlands, ...) include the death.
If the display is "Deaths by date of death" the summary is by that attribute, but may (should) exclude those with no date of first publication.
If the display is "Deaths by date reported" the summary is by date of first publication.
It seems almost as if coronavirus.data.gov.uk hold each death separately in separate databases for each display format. That is such bad practice that I would find it hard to believe if I hadn't worked with PHE.