shpalman wrote: Wed Dec 30, 2020 9:12 am
I noticed, looking at the raw data with KAJ kindly sent me, was that not all reports had data all the way up to the day before the report date. This seems to have happened every weekend in fact. For example, the data from the 4th of December reports data up to the 3rd, but the data on the 5th and 6th also stops on the 3rd, then the data on the 7th goes all the way up to the 6th. This is probably that issue I mentioned previously, that if not all four nations in the UK report a number for the date, the UK total just isn't given, and they are filled in after the weekend. Those numbers count as being reported on the 5th or 6th even if the UK doesn't find out about them until the 7th.
 
Yes. As downloaded a single days data is in "wide" format e.g. some of the columns I downloaded yesterday:
Code: Select all
> aDF[1:20, c(1,10:12)]
         date PubDeaths DateDeaths cumDateDeaths
1  2020-12-28       357         NA            NA
2  2020-12-27       316         NA            NA
3  2020-12-26       230         NA            NA
4  2020-12-25       570         NA            NA
5  2020-12-24       585         NA            NA
6  2020-12-23       744         NA            NA
7  2020-12-22       691        456         69818
8  2020-12-21       215        505         69362
9  2020-12-20       326        437         68857
10 2020-12-19       534        424         68420
11 2020-12-18       489        448         67996
12 2020-12-17       532        495         67548
13 2020-12-16       613        404         67053
14 2020-12-15       506        447         66649
15 2020-12-14       232        471         66202
16 2020-12-13       144        406         65731
17 2020-12-12       519        432         65325
18 2020-12-11       424        436         64893
19 2020-12-10       516        447         64457
20 2020-12-09       533        427         64010
This has one row (date) for each date for which there is any data, implying missing values (NA) where there is no data. 
By the way, often (usually!)  NA and zero should be treated differently. In the case of "Deaths by date-of-death" (DateDeaths) that is debatable. I read this example as meaning that no deaths have (yet!) been reported for 23-28 December. But it is recognised that DateDeaths figures may be incomplete and are subject to change in future publication date, so if NA means "none yet" then NA means zero. But that's by the way. 
Moving on, I store the accumulating data (and sent it to you) in a "stacked" or "long" 3-column format. The first two columns are as published, but I omitted rows with NA for DateDeaths (as they hold no information). I added column "Published" recording which days downloads the data came from. This layout makes it easy (inter alia) to calculate how long after a given date-of-death the given number of deaths was reached; lag = Published - date. 
Code: Select all
           date DateDeaths  Published
2400 2020-12-06         53 2020-12-07
3102 2020-12-05        142 2020-12-07
4104 2020-12-04        231 2020-12-07
5104 2020-12-03        346 2020-12-07
4103 2020-12-03        337 2020-12-06
3101 2020-12-03        294 2020-12-05
<snip>
Your comments are easiest interpreted by looking at an "unstacked" or "wide" format with data from different publication dates in different columns. This requires NA values (subject to my 'by the way' above, but also noting that date-of-death > Publication date really is NA) e.g.
Code: Select all
        date 2020-12-07 2020-12-06 2020-12-05 2020-12-04
1 2020-12-06         53         NA         NA         NA
2 2020-12-05        142         NA         NA         NA
3 2020-12-04        231         NA         NA         NA
4 2020-12-03        346        337        294        119
5 2020-12-02        335        328        320        280
6 2020-12-01        352        350        346        325
You say "if not all four nations in the UK report a number for the date, the UK total just isn't given, and they are filled in after the weekend. Those numbers count as being reported on the 5th or 6th even if the UK doesn't find out about them until the 7th". 
But I'm afraid that contradicts the definition of "Deaths by date reported"  which is 
linkDeaths by date reported - each death is assigned to the date when it was first included in the published totals.
The numbers count as being reported on the date they are published, which cannot be before the UK finds out about them. If they are included in the "Deaths by date reported" I cannot see why they are not included in the "Deaths by date of death".
As I understand it, each death has (inter alia) three attributes:
- Location
- Date of death
- Date of first publication (possibly blank for a time)
Location controls which collections (UK, England, Midlands, ...) include the death. 
If the display is "Deaths by date of death" the summary is by that attribute, but may (should) exclude those with no date of first publication.
If the display is "Deaths by date reported" the summary is by date of first publication.
It seems almost as if coronavirus.data.gov.uk hold each death separately in separate databases for each display format. That is such bad practice that I would find it hard to believe if I hadn't worked with PHE.