CanGraph.ExposomeExplorer package#

Since the E-E schema is confidential, a representation of a graph, created using NetworkX on python

This package, created as part of my Master’s Intenship at IARC, transitions the exposome-explorer database (a high quality, hand-curated database containing associations of foods and chemical compounds with cancer) to Neo4J format in an automated way, providing an export in GraphML format.

To run, it uses alive_progress to generate an interactive progress bar (that shows the script is still running through its most time-consuming parts) and the neo4j python driver. This requirements can be installed using: pip install -r requirements.txt.

To run the script itself, use:

python3 main.py neo4jadress databasename databasepassword csvfolder

where:

  • neo4jadress: is the URL of the database, in neo4j:// or bolt:// format

  • databasename: the name of the database in use. If using the free version, there will only be one database per project (neo4j being the default name); if using the pro version, you can specify an alternate name here

  • databasepassword: the passowrd for the databasename DataBase. Since the arguments are passed by BaSH onto python3, you might need to escape special characters

  • csvfolder: The folder where the CSV files for the Exposome Explorer database are stored. This CSVs have to be manually exported from the (confidential) database itself, and are NOT equivalent to those found in exposome-explorer download’s page

An archived version of this repository that takes into account the gitignored files can be created using: git archive HEAD -o ${PWD##*/}.zip


The package consists of the following modules:

CanGraph.ExposomeExplorer.build_database module#

A python module that provides the necessary functions to transition the Exposome Explorer database to graph format, either from scratch importing all the nodes (as showcased in CanGraph.ExposomeExplorer.main) or in a case-by-case basis, to annotate existing metabolites (as showcased in CanGraph.main).

add_cancer_associations(filename)[source]#

Imports the ‘cancer_associations’ database as a relation between a given Cancer and a Measurement

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

add_components(filename)[source]#

Adds “Metabolite” nodes from Exposome-Explorer’s components.csv This is because this components are, in fact, metabolites, either from food or from human metabolism

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

add_correlations(filename)[source]#

Imports the ‘correlations’ database as a relation between two measurements: the intake_id, a food taken by the organism and registered using dietary questionnaires and the excretion_id, a chemical found in human biological samples, such that, when one takes one component, one will excrete the other. Data comes from epidemiological studies where dietary questionnaires are administered, and biomarkers are measured in specimens

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

add_measurements_stuff(filename)[source]#

A massive and slow-running function that creates ALL the relations between the ‘measurements’ table and all other related tables:

  • units: The units in which a given measurement is expressed

  • components: The component which is being measured

  • samples: The sample from which a measurement is taken

  • experimental_methods: The method used to take a measurement

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

add_metabolomic_associations(filename)[source]#

Imports the ‘metabolomic_associations’ database as a relation between to measurements: the intake_id, a food taken by the organism and registered using dietary questionnaires and the excretion_id, a chemical found in human biological samples, such that, when one takes one component, one will excrete the other. Data comes from Metabolomics studies seeking to identify putative dietary biomarkers.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

add_microbial_metabolite_identifications(filename)[source]#

Imports the relations pertaining to the “microbial_metabolite_identifications” table. A component (i.e. a metabolite) can be identified as a Microbial Metabolite, which means it has an equivalent in the microbiome. This can have a given reference and a tissue (BioSpecimen) in which it occurs.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

add_reproducibilities(filename)[source]#

Creates relations between the “reproducibilities” and the “measurements” table, using “initial_id”, an old identifier, for the linkage

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

add_samples(filename)[source]#

Imports the relations pertaining to the “samples” table. A sample will be taken from a given subject and a given tissue (that is, a specimen, which will be blood, urine, etc)

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

add_subjects(filename)[source]#

Imports the relations pertaining to the “subjects” table. Basically, a subject can appear in a given publication, and will be part of a cohort (i.e. a grop of subjects)

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_auto_units(filename)[source]#

Shows the correlations between two units, converted using the rubygem ‘https://github.com/masa16/phys-units’ which standarizes units of measurement for our data

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_cancers(filename)[source]#

Adds “Cancer” nodes from Exposome-Explorer’s cancers.csv

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_cohorts(filename)[source]#

Adds “Cohort” nodes from Exposome-Explorer’s cohorts.csv

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_experimental_methods(filename)[source]#

Adds “ExperimentalMethod” nodes from Exposome-Explorer’s experimental_methods.csv

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_measurements(filename)[source]#

Adds “Measurement” nodes from Exposome-Explorer’s measurements.csv

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_microbial_metabolite_info(filename)[source]#

Adds “Metabolite” nodes from Exposome-Explorer’s microbial_metabolite_identifications.csv These represent all metabolites that have been re-identified as present, for instance, in the microbiome.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_publications(filename)[source]#

Adds “Publication” nodes from Exposome-Explorer’s publications.csv

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_reproducibilities(filename)[source]#

Adds “Reproducibility” nodes from Exposome-Explorer’s reproducibilities.csv These represent the conditions under which a given study/measurement was carried

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_samples(filename)[source]#

Adds “Sample” nodes from Exposome-Explorer’s samples.csv From a Sample, one can take a series of measurements

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_specimens(filename)[source]#

Annotates “BioSpecimen” nodes from Exposome-Explorer’s specimens.csv whose ID is already present on the DB A biospecimen is a type of tissue where a measurement can originate, such as orine, csf fluid, etc

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_subjects(filename)[source]#

Annotates “Subject” nodes from Exposome-Explorer’s subjects.csv whose ID is already present on the DB

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

annotate_units(filename)[source]#

Adds “Unit” nodes from Exposome-Explorer’s units.csv A unit can be converted into other (for example, for normalization)

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

build_from_file(databasepath, Neo4JImportPath, driver, bar=None, do_all=False, keep_counts_and_displayeds=True, keep_cross_properties=False)[source]#

A function able to build a portion of the Exposome-Explorer database in graph format, provided that at least one “Component” (Metabolite) node is present in said database. It works by using that node as an starting point from which to search in the rest of the Exposome_Explorer database, finding related nodes there.

Parameters
  • databasepath (str) – The path to the database where all Exposome-Explorer CSVs are stored

  • Neo4JImportPath (str) – The path from which Neo4J is importing data

  • driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use

  • bar – The bar() object from alive_bar, in case we want the function to run with do_all=True

  • do_all (bool) – True if importing the whole database; False if just importing a part of it

  • keep_counts_and_displayeds (bool) – Whether to keep the properties ending with `_count` & `displayed_` that, although present in the original DB, might be considered not useful for us.

  • keep_cross_properties (bool) – Whether to keep the properties used to cross-reference in the original Neo4J database.

Returns

This function modifies the Neo4J Database as desired, but does not produce any particular return.

Note

This wont work if a “Component” (Metabolite) node is not already present; when building the database, either full or by parts, you should import the respective Components first

Warning

Due to the script’s design, only nodes which have a connection to nodes previously present on the database will be imported. This is on purpose: unconnected nodes don’t mean much in a Graph DataBase

import_csv(filename, label)[source]#

Imports a given CSV into Neo4J. This CSV must be present in Neo4J’s Import Path

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the CSV file that is being imported

  • label (str) – The label of the Neo4J nodes that will be imported, with the columns of the CSV being its properties.

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

Note

For this to work, you HAVE TO have APOC availaible on your Neo4J installation

remove_counts_and_displayeds(inputfile, outputfile)[source]#

Removes `_count` & `displayed_` text-strings from a given file, so that, when processing it with the other functions present in this document, they ignore the columns containing said text-strings, which represent properties which are considered not useful for our program. This is. of course, not the most elegant, but it works.

Parameters
  • inputfile (str) – The path to the file from which `_count` & `displayed_ text-strings are to be removed

  • outputfile (str) – The path of the file where the contents of the replaced file will be written.

Returns

The function does not have a return; instead, it transforms `inputfile`` into `outputfile`

remove_cross_properties()[source]#

Removes some properties that were added by the other functions present in this script, that are used to cross-reference the different tables in the Relational Database EE comes from, and that, in a Graph Database, are no longer necessary.

Parameters

tx (neo4j.Session) – The session under which the driver is running

Returns

A Neo4J connexion to the database that modifies it accordingly.

Return type

neo4j.Result

CanGraph.ExposomeExplorer.main module#

A python module that leverages the functions present in the build_database module to recreate the exposome-explorer database using a graph format and Neo4J, and then provides an GraphML export file.

Please note that, to work, the functions here pre-suppose you have access to Exposome-Explorer internal CSVs, and that you have placed them under a folder provided as `sys.argv[4]`. These CSVs are confidential, and can only be accessed under request to the International Agency for Research on Cancer.

For more details on how to run this script, please consult the package’s README

main(args)[source]#

The function that executes the code