CanGraph.GraphifyHMDB package#

The Schema for the GraphifyHMDB package, shown on Neo4J browser

This script, created as part of my Master’s Intenship at IARC, imports nodes from the Human Metabolome Database (a high quality, database containing a list of metabolites and proteins associated to different diseases) to Neo4J format in an automated way, providing an export in GraphML format.

To run, it uses alive_progress to generate an interactive progress bar (that shows the script is still running through its most time-consuming parts) and the neo4j python driver. This requirements can be installed using: pip install -r requirements.txt.

To run the script itself, use:

python3 main.py neo4jadress username databasepassword

where:

  • neo4jadress: is the URL of the database, in neo4j:// or bolt:// format

  • username: the username for your neo4j instance. Remember, the default is neo4j

  • password: the passowrd for your database. Since the arguments are passed by BaSH onto python3, you might need to escape special characters

Please note that there are two kinds of functions in the associated code: those that use python f-strings, which themselves contain text that cannot be directly copied into Neo4J (for instance, double brackets have to be turned into simple brackets) and normal multi-line strings, which can. This is because f-strings allow for variable customization, while normal strings dont.

An archived version of this repository that takes into account the gitignored files can be created using: git archive HEAD -o ${PWD##*/}.zip

Important Notices#

  • Please ensure you have internet access, enough espace in your hard drive (around 5 GB) and read-write access in ./xmlfolder. The files needed to build the database will be stored there.

  • There are two kinds of high-level nodes stored in this database: “Metabolites”, which are individual compounds present in the Human Metabolome; and “Proteins”, which are normally enzimes and are related to one or multiple metabolites. There are different types of metabolites, but they were all imported in the same way; their origin can be differenced by the “” field on the corresponding “Concentration” nodes. You could run a query such as: MATCH (n:Metabolite)-[r:MEASURED_AT]-(c:Concentration) RETURN DISTINCT c.Biospecimen

  • Some XML tags have been intentionally not processed; for example, the tag seemed like too much info unrelated to our project, or the tags, which could be useful but seemed to only link to external DBs


The package consists of the following modules:

CanGraph.GraphifyHMDB.build_database module#

A python module that provides the necessary functions to transition the HMDB database to graph format, either from scratch importing all the nodes (as showcased in CanGraph.GraphifyHMDB.main) or in a case-by-case basis, to annotate existing metabolites (as showcased in CanGraph.main).

add_biological_properties(filename)[source]#

Adds biological properties to existing “Metabolite” nodes based on XML files obtained from the HMDB website. In this case, only properties labeled as <predicted_properties> are added.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Another option would have been to auto-add all the properties, and name them using RETURN “Predicted ” + apoc.text.capitalizeAll(replace(kind, “_”, ” “)), value; however, this way we can select and not duplicate / overwrite values.

Todo

It would be nice to be able to distinguish between experimental and predicted properties

add_concentrations_abnormal(filename)[source]#

Creates “Concentration” nodes based on XML files obtained from the HMDB website. In this function, only metabolites that are labeled as “abnormal_concentration” are added.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Here, an UNWIND clause is used instead of a FOREACH clause. This provides better performance, since, unlike FOREACH, UNWIND does not process rows with empty values

Warning

Using the CREATE row forces the creation of a Concentration node, even when some values might be missing. However, this means some bogus nodes could be added, which MUST be accounted for at the end of the DB-Creation process.

add_concentrations_normal(filename)[source]#

Creates “Concentration” nodes based on XML files obtained from the HMDB website. In this function, only metabolites that are labeled as “normal_concentration” are added.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Here, an UNWIND clause is used instead of a FOREACH clause. This provides better performance, since, unlike FOREACH, UNWIND does not process rows with empty values

Warning

Using the CREATE row forces the creation of a Concentration node, even when some values might be missing. However, this means some bogus nodes could be added, which MUST be accounted for at the end of the DB-Creation process.

add_diseases(filename)[source]#

Creates “Publication” nodes based on XML files obtained from the HMDB website.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Here, an UNWIND clause is used instead of a FOREACH clause. This provides better performance, since, unlike FOREACH, UNWIND does not process rows with empty values (and, logically, there should be no Publication if there is no Disease)

Note

Publications are created with a (m)-[r:CITED_IN]->(p) relation with Metabolite nodes. If one wants to find the Publication nodes related to a given Metabolite/Disease relation, one can use:

MATCH p=()-[r:RELATED_WITH]->()
  WITH split(r.PubMed_ID, ",") as pubmed
    UNWIND pubmed as find_this
    MATCH (p:Publication)
      WHERE p.PubMed_ID = find_this
RETURN p
add_experimental_properties(filename)[source]#

Adds properties to existing “Metabolite” nodes based on XML files obtained from the HMDB website. In this case, only properties labeled as <experimental_properties> are added.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Another option would have been to auto-add all the properties, and name them using RETURN “Experimental ” + apoc.text.capitalizeAll(replace(kind, “_”, ” “)), value; however, this way we can select and not duplicate / overwrite values.

Todo

It would be nice to be able to distinguish between experimental and predicted properties

add_gene_properties(filename)[source]#

Adds some properties to existing “Protein” nodes based on XML files obtained from the HMDB website. In this case, properties will mostly relate to the gene from which the protein originates.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

We are not creating “Gene” nodes (even though each protein comes from a given gene) because we believe not enough information is being given about them.

add_general_references(filename, type_of)[source]#

Creates “Publication” nodes based on XML files obtained from the HMDB website.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Since not all nodes present a “PubMed_ID” field (which would be ideal to uniquely-identify Publications, as the “Text” field is way more prone to typos/errors), nodes will be created using the “Authors” field. This means some duplicates might exist, which should be accounted for.

Note

Unlike the rest, here we are not matching metabolites, but ALSO proteins. This is intentional.

add_go_classifications(filename)[source]#

Creates “Gene Ontology” nodes based on XML files obtained from the HMDB website. This relates each protein to some GO-Terms

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

add_metabolite_associations(filename)[source]#

Adds associations contained in the “protein” file, between proteins and metabolites.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Like he “add_metabolite_associations” function, this creates non-directional relationships (m)-[r:ASSOCIATED_WITH]-(p) ; this helps duplicates be detected.

Note

The “ON CREATE SET” clause for the “Name” param ensures no overwriting

add_metabolite_references(filename)[source]#

Creates references for relations betweens Protein nodes and Metabolite nodes

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Warning

Unfortunately, Neo4J makes it really, really, really difficult to work with XML, and so, this time, a r.PubMed_ID list with the references could not be created. Nonetheless, I considered adding this useful.

add_metabolites(filename)[source]#

Creates “Metabolite” nodes based on XML files obtained from the HMDB website, adding some essential identifiers and external properties.

See also

This way of working has been taken from William Lyon’s Blog

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

add_predicted_properties(filename)[source]#

Adds properties to existing “Metabolite” nodes based on XML files obtained from the HMDB website. In this case, only properties labeled as <predicted_properties> are added.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Another option would have been to auto-add all the properties, and name them using RETURN “Predicted ” + apoc.text.capitalizeAll(replace(kind, “_”, ” “)), value; however, this way we can select and not duplicate / overwrite values.

Todo

It would be nice to be able to distinguish between experimental and predicted properties

add_protein_associations(filename)[source]#

Creates “Protein” nodes based on XML files obtained from the HMDB website.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

Unlike the “add_protein” function, this creates Proteins based on info on the “Metabolite” files, not on the “Protein” files themselves. This could mean node duplication, but, hopefully, the MERGE by Accession will mean that this duplicates will be catched.

add_protein_properties(filename)[source]#

Adds some properties to existing “Protein” nodes based on XML files obtained from the HMDB website. In this case, properties will mostly relate to the protein itself.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

The “signal_regions” and the “transmembrane_regions” properties were left out because, after a preliminary search, they were mostly empty

add_proteins(filename)[source]#

Creates “Protein” nodes based on XML files obtained from the HMDB website.

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

We are not creating “Gene” nodes (even though each protein comes from a given gene) because we believe not enough information is being given about them.

add_taxonomy(filename)[source]#

Creates “Taxonomy” nodes based on XML files obtained from the HMDB website. These represent the “kind” of metabolite we are dealing with (Family, etc)

Parameters
  • tx (neo4j.Session) – The session under which the driver is running

  • filename (str) – The name of the XML file that is being imported

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

neo4j.Result

Note

It only creates relationships in the Kingdom -> Super Class -> Class -> Subclass direction, and from any node -> Metabolite. This means that, if any member of the Kingdom -> Super Class -> Class -> Subclass is absent, the line will be broken; hopefully in that case a new metabolite will come in to rescue and settle the relation!

build_from_metabolite_file(newfile, driver)[source]#

A function able to build a portion of the HMDB database in graph format, provided that one “Metabolite” XML is supplied to it. This are downloaded separately from the website, as all the files that are not `hmdb_proteins.zip`, and can be presented either as the full file, or as a splitted version of it, with just one item per file (which is recommended due to memory limitations)

Parameters
  • newfile (str) – The path of the XML file to import

  • driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use

Returns

This function modifies the Neo4J Database as desired, but does not produce any particular return.

build_from_protein_file(newfile, driver)[source]#

A function able to build a portion of the HMDB database in graph format, provided that one “Protein” XML is supplied to it. This are downloaded separately from the website, as `hmdb_proteins.zip`, and can be presented either as the full file, or as a splitted version of it, with just one item per file (which is recommended due to memory limitations)

Parameters
  • newfile (str) – The path of the XML file to import

  • driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use

Returns

This function modifies the Neo4J Database as desired, but does not produce any particular return.

CanGraph.GraphifyHMDB.main module#

A python module that leverages the functions present in the build_database module to recreate the HMDB database using a graph forma and Neo4J, and then provides an GraphML export file.

Please note that, to work, the functions here pre-suppose you have internet access, which will be used to download HMDB’s XMLs under `./xmlfolder/` (please ensure you have read-write access there).

For more details on how to run this script, please consult the package’s README

main()[source]#

The function that executes the code