Please help me with R!

Fishnut · Post by **Fishnut** » Tue Feb 16, 2021 12:13 pm

I shall preface by saying I'm not very good with R or coding in general. I can usually just about get by because I'm not doing anything hugely complicated. This doesn't feel like a hugely complicated task but my googling has come to naught. I'm hoping someone is going to take one look at this, roll their eyes at how simple the solution is, and help me out.

My dataset comprises trace element analyses for a load of fish otoliths from a load of different sites. The dataset comprises multiple trace element measurements for each otolith providing a bit of a time series but for this analysis I'm just interested in the core (birth) and margin (death). I've also got 5 different location categories, with site being the most specific and the others are various management areas.

In the past I created separate spreadsheets for the core and margin data and another separate spreadsheet with the site means, but I'm trying to be better and work off a single spreadsheet, so need to create these datasets internally in R.

I've been able to create the core and margin datasets using subset easily enough but it's calculating the means where I'm having difficulty. I tried using melt which retained the location categories but when I cast to calculate the means I hit a snag - all the location categories except site (as I'm interested in the site means) are discarded. I need a way of calculating site means but retaining the other location category columns.

Is melt and cast the best way to do this or is there another way I'm missing?

My melt/cast code is this:

Code: Select all

LC_core_melt <- melt(LC_core, id=c("Fish_Code", "Fisheries", "MEOW_Province", "MEOW_Ecoregion", "IMCRA_Bioregion", "Site")) 

LC_site_means <- cast(LC_core_melt, Site~variable, mean)

Post by **Bird on a Fire** » Tue Feb 16, 2021 3:29 pm

Hi fishnut!

Would you be able to post the head() of your dataset? It's a bit easier to visualise the data structure with a few lines of preview. It could also be handy to see the head() of the output you're getting, and perhaps an example of what you'd actually like.

I'm slightly struggling to understand the output you need. If I've got you correctly, you want to calculate one single mean for each site. Those sites are nested within broader location categories, so you'd like to be able to see something like this at the end:

Code: Select all

 region Site        mean
      A    1  0.04145835
      A    2  0.12611889
      B    3 -0.44337403
      B    4 -1.04513258

If so, the easiest thing might be to calculate the means as you're currently doing, and then use the join() function to add those means to your original dataset. Something like

Code: Select all

library(dplyr)
join(LC_core, LC_site_means)

should link them up by the columns that match, but it very much depends on the structure of your datasets.

You probably don't need to go through melting and casting if you just want a column of means. Within the dplyr package are various functions ending in *ply, depending on the type of input you have and output you want. For this case, ddply takes a data.frame as input and gives a data.frame as output.

So

Code: Select all

ddply(LC_core, .(Site), function(x) {mean(x$element)})

will return a dataset giving the mean of the element column (not sure what it's really called!), which you should then be able to join back to the original LC_core data.frame.

I think.

FWIW I'm probably a reasonably advanced R user, but I still spend about 90% of my time munging data about into the correct formats, and 10% on the actual analyses. This stuff isn't easy even if it feels like it should be basic, but over time you'll build up a big toolbox.

Fishnut · Post by **Fishnut** » Tue Feb 16, 2021 6:20 pm

Thanks for that. I tried the "join" function but it said it couldn't be found, even though I definitely have the package installed and active. Not sure what's going on there.

Anyway, yeah, it all did get a bit confusing. Sorry! So I'll try again.

This is my core data.
.

: Screenshot 2021-02-16 at 18.06.57.png (359.83 KiB) Viewed 1922 times

.
So what I want to get is the means of each site, but still keeping all those other geographical descriptors as well. This is a screenshot from my means spreadsheet. I just want to recreate this in R from my core data if at all possible.
.

: Screenshot 2021-02-16 at 18.17.22.png (259.21 KiB) Viewed 1922 times

Post by **Bird on a Fire** » Tue Feb 16, 2021 10:15 pm

Ok great, thanks! That's clear now.

So the problem here isn't with the calculation of means per se, it's linking the calculated means to the rest of the data. It's basically a database join problem, but luckily a pretty straightforward one I think.

It's weird about R not finding join() if you have the dplyr or plyr package loaded. You can force R to look within a particular package by using a double colon (ooo errr) like this:

Code: Select all

dplyr::join(this_thing, that_thing)

but if R is just being super weird I'm fairly sure the merge() function from base R does the same thing in this case, just slower which shouldn't matter unless you have a bazillion otoliths.

If you want a dataset with a row for each site, and preserving all the higher level geographical information, you need to make two data.frames and then join/merge them.

Dataframe 1 is the means you already have.

Dataframe 2 is a dataset with one line for each site, and all the other columns of interest, but without all the individual measurements. You can get this by doing a subset of your dataset that only has the columns you need. Then, you can use the duplicated() function to find all the extra site rows, because many sites will have multiple measurements:

Code: Select all

new_df_withoutAllThosePeskyDuplications <- new_df[!duplicated,]

(The duplicated function finds all the duplicated rows in whatever it's given (so it also works on vectors, which is very handy), and then you're using the square brackets to subset the data.frame and the ! to remove whichever rows return TRUE (this time it is Boolean).)

Your new_df_withoutAllThosePeskyDuplications will then have one line for each site, and all the other columns you selected for the subset earlier (which as I understand it should all have unique values).

To create the table of your dreams, all you need to do now is join them. The merge() and join() functions both work in the same way, appending the entries of the second table to the first table, according to the values in columns shared between the two tables. Your two data.frames should have exactly one column in common, Site, which will be used to join them.

Running this:

Code: Select all

joined_data <- merge(new_df_withoutAllThosePeskyDuplications, Site_Means)

should give you a dataset of unique site details, with an additional column for the mean of whatever value.

If there are several values you want to take site means of, you can then repeat the process adding each new column of means to joined_data, and so on.

I hope some of that makes sense? As it happens I've just been tackling almost exactly the same problem for a PhD analysis, but with network analysis instead of isotopes. (I had a cool isotope plan for my PhD, but dropped it because I couldn't get the samples)

I've sort-of deliberately not given all the exact code, because finding the answers on my own with some hints was how I learned, and it sounds like you're keen to build this as a skill long term. But also I'd be happy to go through the actual data, by email or zooming or just on here. I'm a bit of an R evangelist and I've had plenty of help with it, so I'm trying to pay it forward

Fishnut · Post by **Fishnut** » Tue Feb 16, 2021 10:18 pm

Thanks so much! That looks like it makes sense though my eyes are starting to go blurry so will take a proper look in the morning.

Post by **Bird on a Fire** » Tue Feb 16, 2021 10:30 pm

Good luck!

Fishnut · Post by **Fishnut** » Mon Feb 22, 2021 9:01 pm

Thanks to BoaF I was able to fix my problem - thank you so much!

I now have a new problem that I was hoping someone might be able to help with. I'm plotting multiple graphs on a single plot with a single legends and I have copied code from the interwebs to be able to do this. I have no clue what most of it does, but it works. However, the legend font size is really quite small and no matter what I do to adjust it, nothing works. I can adjust the individual graph titles, the multiplot graph title and the axes titles, but not the legend. I was hoping someone might be able to take a look at the code and spot where I can adjust the legend font size.

This is the theme for the individual graphs (the annotations are mine which is why they're so basic)

Code: Select all

my_theme <- list(theme(panel.border = element_blank(),                      # Remove panel border
                       panel.grid = element_blank(),                        # Remove panel grid lines
                       panel.background = element_blank(),                  # Remove background
                       axis.line = element_line(colour = "black"),          # Add axis lines
                       legend.key = element_rect(fill = "white"),           # Change the legend background to white
                       legend.title = element_blank(),                      # Remove legend title
                       plot.title = element_text(hjust = 0.5, size = 8),    # change the title position and font size
                       plot.subtitle = element_text(hjust = 0.5, size = 8), # change the subtitle position and font size
                       axis.title.x = element_text(size = 8),               # change the x-axis font size
                       axis.title.y = element_text(size = 8)))

And this is the code for the multiplot (it uses ggplot2) which I got from somewhere and don't understand,

Code: Select all

grid_arrange_shared_legend <- function(plots=NULL,titleText=NULL,
                                       titleFont=2,titleSize=20,titleSpace=0.05){
  if(is.null(plots)) stop("Needs the plots as a list here!") 
  g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs
  legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
  lheight <- sum(legend$height)
  tt <- textGrob(titleText,gp=gpar(fontsize=titleSize,font=titleFont)) # alter the fontsize and font for title font
  theight <- unit(titleSpace,"npc") 
  grid.arrange(
    tt,
    do.call(arrangeGrob, lapply(plots, function(x)
      x + theme(legend.position="none"))),
    legend,
    ncol = 1,
    heights = unit.c(theight, unit(1, "npc") - lheight - theight, lheight))
}

sTeamTraen · Post by **sTeamTraen** » Mon Feb 22, 2021 10:24 pm

Try changing the line that says

Code: Select all

g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs

to

Code: Select all

g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom", legend.text=element_text(size=12)))$grobs

If that makes a difference, google the various things that you can do with element_text. For example you can say

Code: Select all

element_text(size=rel(0.5))

which will make the legend text half the height of the main theme text.

Fishnut · Post by **Fishnut** » Mon Feb 22, 2021 11:29 pm

That works, thank you!!!!

Scrutable

Please help me with R!

Please help me with R!

Re: Please help me with R!

Re: Please help me with R!

Re: Please help me with R!

Re: Please help me with R!

Re: Please help me with R!

Re: Please help me with R!

Re: Please help me with R!

Re: Please help me with R!