+1
Answered

Query metadata based on ENCODE accession IDs (DeepBlueR)

floris.barthel 9 months ago • updated by deepblue 9 months ago 7

Hi, I'm interested to use DeepBlue to fetch ENCODE metadata based on the accesssion IDs.


Eg. ENCSR000AEH, ENCSR000AEF, ENCSR000AED


This can be done in the package ENCODExplorer but I could not find such features in DeepBlueR. https://www.bioconductor.org/packages/release/bioc/html/ENCODExplorer.html

Hello,

as DeepBlue is a multi-project data server (ENCODExplorer is focused only on ENCODE), we try to have general solutions to list and find the data from multiple projects.


Answering your question:


The ENCODE data imported into DeepBlue has the attribute 'accession' in the extra-metadata.

Check it at: (use ENCODE in the "project" column)

http://deepblue.mpi-inf.mpg.de/dashboard.php#ajax/deepblue_view_experiments.php


What you can do:

use the command [deepblue_]list_experiments, passing ENCODE as the project, this command returns a list of IDs and names.

For the IDs,execute the command [deepblue_]info(). This command returns the full metadata for the given IDs.
You can filter the experiments by accession using the 'accession' in the extra_metadata of these experiments.


Let me know if you answer your questions.


Thank you,
Felipe Albrecht


Hey Felipe,


Thanks for your response. I'm currently trying the following:


> experiments = deepblue_list_experiments()
Called method: deepblue_list_experiments
Reported status was: okay
> experiment_meta = deepblue_info(id = experiments$id)

The deepblue_info() command either hangs or is taking a long time (been waiting 30 minutes or so).


Thanks,

Floris

Hello,


when you execute the deepblue_list_experiments(), it returns the IDs and names of all avaialble experiments (almost 40k this time).

So, the info() will return an huge XML data, that is parsed by the R, that it is quite slow.


I strongly suggest you to filter the type of experiments that you want.

Examples:


DNA Methylation data

dna_methylation_exps = deepblue_list_experiments(project="ENCODE", epigenetic_mark="DNA Methylation")


H3K27ac peaks (bed files)

deepblue_list_experiments(project="ENCODE", epigenetic_mark="H3k27ac", type="peaks")


H3K27ac peaks (signal files)

deepblue_list_experiments(project="ENCODE", epigenetic_mark="H3k27ac", type="signal")



I hope this helps you.


Thanks! This works well. Any suggestions on processing the resulting info?


The function given in the tutorial does not work for me:


# Obtain the information about the experiment_id
  info = deepblue_info(experiment_id)

  # Print the experiment name, project, biosource, and epigenetic mark.
  with(info, { data.frame(name = name, project = project,
    biosource = sample_info$biosource_name, epigenetic_mark = epigenetic_mark)
      })

This returns an error.


I've also tried to use many different tidyverse options try to convert it to a managable nested data frame (eg. combinations of flatten and unnest) structure but I'm not having any luck.

+1

Solved, using purrr does the trick for me

http://r4ds.had.co.nz/lists.html#hierarchy

https://jennybc.github.io/purrr-tutorial/ls01_map-name-position-shortcuts.html.


If this is helpful to anyone, I used the following code:


> experiments = deepblue_list_experiments(project = "ENCODE")
> ## Note the following step can take quite a while
> tmp = deepblue_info(id = experiments$id)
> meta = tibble(experiment_id = map_chr(tmp, "_id"),
+               file_accession = map(tmp, "extra_metadata") %>% map_chr("file_encode_accession", .null = NA), ## Ensures missing data does not throw error
+               sample_accession = map(tmp, "sample_info") %>% map_chr("accession", .null = NA),
+               genome = map_chr(tmp, "genome"),
+               epigenetic_mark = map_chr(tmp, "epigenetic_mark"),
+               description = map_chr(tmp, "description"),
+               project = map_chr(tmp, "project"),
+               technique = map_chr(tmp, "technique"),
+               file_type = map(tmp, "extra_metadata") %>% map_chr("file_type", .null = NA),
+               biosource_name = map(tmp, "sample_info") %>% map_chr("biosource_name", .null = NA),
+               biosource_type = map(tmp, "sample_info") %>% map_chr("biosample_type", .null = NA))

Glad that you found the answer and thanks for sharing it!