CanGraph package

deploy_webdocs(docs_folder='./docs/', work_dir='.', prechecks_done=False, custom_domain=None)[source]#

Generates the HTML web docs, and publishes it both to Github and Codeberg pages

Parameters

docs_folder (str) – The path to sphinx’s docs folder, where the tests will be run; by default ./docs/
work_dir (str) – The current Working Directory; by default
prechecks_done (bool) – Whether the prechecks present in ~CanGraph.deploy.make_sphinx_prechecks have already been made
custom_domain (str) – A custom domain to deploy de docs to.

Returns

Whether the prechecks have already been done; always True if the function is run

Return type

Note

For custom_domain to work, please configure your DNS records apparently

Note

modules.rst is not removed, but it is correctly ignored in conf.py

git_push(path_to_repo, remote_names, commit_message, force=False)[source]#

Pushes the current repo’s state and current branch to a remote git repository

Parameters

path_to_repo (str) – The path to the local .git folder
remote_names (list or str) – The names of the remote to which we want to commit, which must be previously configured (see CanGraph.setup.setup_git). e.g.: [“github”, “codeberg”]
commit_message (str) – The Git Commit Message for the current repo’s state
force (bool) – Whether to force the commit (necessary if you are resetting the HEAD)

Note

gitpython is not good at managing complex commit messages (i.e. those with a Subject and a Body). If you want to add one of those, please, use \n as the separator; the function will take care of the rest

See also

The approach taken hare was inspired by StackOverflow #41836988

main()[source]#: The function that executes the code

make_sphinx_prechecks(docs_folder='./docs/', work_dir='.', gen_apidocs=False)[source]#

Generates sphinx api-docs for automatic documentation and uses make linkcheck` to check for broken links

Parameters

gen_apidocs (bool) – Whether to re-generate the API docs. Default is False since we use MD (we dont want RST files)
docs_folder (str) – The path to sphinx’s docs folder, where the tests will be run; by default ./docs/
work_dir (str) – The current Working Directory; by default .

CanGraph.main module#

A python module that leverages the functions present in the miscelaneous module and all other subpackages to annotate metabolites using a graph format and Neo4J, and then provides an GraphML export file.

CanGraph.main Usage#

To use this module:

A python utility to study and analyse cancer-associated metabolites using knowledge graphs

usage: python3 main.py [-h] [-c] [-n] [-s] [-w] [-i] --query QUERY
                       [--dbfolder DBFOLDER] [--results RESULTS]
                       [--adress ADRESS] [--username USERNAME]
                       [--password PASSWORD]

Named Arguments#

-c, --check_args: Checks if the rest of the arguments are OK, then exits
-n, --noindex: Runs the program checking each file one-by-one, instead of using a JSON index
-s, --similarity: Deactivates the import of information based on Structural Similarity.This might dramatically increase processing time; default is True.
-w, --webdbs: Activates import of information based on web databases.This might dramatically increase processing time; default is True.
-i, --interactive: tells the script if it wants interaction from the user and more information shown to them; similar to –verbose
--query: The location of the CSV file in which the program will search for metabolites
--dbfolder: The folder indicated to `setup.py` as the one where your databases will be stored; default is ./DataBases
--results: The folder where the resulting GraphML exports will be stored; default is ./Results
--adress: the URL of the database, in neo4j:// or bolt:// format
--username: the username of the neo4j database in use
--password: the password for the neo4j database in use. NOTE: Since passed through bash, you may need to escape some chars

You may find more info in the package’s README.

Note

For this program to work, the Git environment has to be set up first. You can ensure this by using: CanGraph.setup.setup_git

CanGraph.main Functions#

This module is comprised of:

add_mesh_and_metanetx(driver)[source]#

Add MeSH Term IDs, Synonym relations and Protein interactions to existing nodes using MeSH and MetaNetX Also, adds Kegg Pathway IDs

Parameters: driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
Returns: This function modifies the Neo4J Database as desired, but does not produce any particular return.

annotate_using_wikidata(driver)[source]#

Once we finish the search, we annotate the nodes added to the database using WikiData

Parameters: driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
Returns: This function modifies the Neo4J Database as desired, but does not produce any particular return.

Todo

When fixing queries, fix the main subscript also

args_parser()[source]#

Parses the command line arguments into a more usable form, providing help and more

Returns: A dictionary of the different possible options for the program as keys, specifying their set value. If no command-line arguments are provided, the help message is shown and the program exits.
Return type: argparse.ArgumentParser

Note

Note that, in Google Docstrings, if you want a multi-line Returns comment, you have to start it in a different line :(

Note

The return must be of type argparse.ArgumentParser for the argparse directive to work and auto-gen docs

Note

By using argparse.const instead of argparse.default, the check_file function will check “” (the current dir, always exists) if the arg is not provided, not breaking the function; if it is, it checks it.

build_from_file(filepath, Neo4JImportPath, driver)[source]#

Imports a given metabolite from a sigle-metabolite containing file by checking its type and calling the appropriate import functions.

Parameters

filepath (str) – The path to the file in which will be imported
Neo4JImportPath (str) – The path which Neo4J will use to import data
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use

Returns

This function does not provide a particular return, but rather imports the requested file

Note

The filepath may be absolute or relative, but it is transformed to a relative relpath in order to remove possible influence of higher-name folders in the import type selection. This is also why the condition is stated as a big “if/elif/else” instead of a series of “ifs”

find_reasons_to_import_all_files(filepath, similarity, chebi_ids, names, hmdb_ids, inchis, mesh_ids)[source]#

Finds reasons to import a metabolite given a candidate filepath with one metabolite per file and a series of lists containing all synonyms of the values considered reasons for import

Parameters

filepath (str) – The path to the file in which we will search for reasons to import
similarity (bool) – Whether to use similarity as a measure to import or not
chebi_ids (list) – A list of all the ChEBI_ID which are considered a reason to import
names (list) – A list of all the Name which are considered a reason to import
hmdb_ids (list) – A list of all the HMDB_ID which are considered a reason to import
inchis (list) – A list of all the InChI which are considered a reason to import
mesh_ids (list) – A list of all the MeSH_ID which are considered a reason to import

Returns

A list of the methods that turned out to be valid for import, such as Name, ChEBI_ID…

Return type

find_reasons_to_import_inchi(query, subject)[source]#

Takes two chains of text and finds if the query is present in the subject, or if there are molecules common between them with at least 95% similarity

Parameters

query (str or list) – A string or list of strings describing valid InChI(s)
subject (str) – A valid InChI

Returns

A dict with each query as a key and the reason to import it as value, if there is one.

Return type

dict

See also

This approach was taken from Chemistry StackExchange #82144

Note

Since this is a one-to-one comparison, subject and query can be used interchangeably; however, bear in mind that only the query can be provided as a list

import_based_on_all_files(all_files, Neo4JImportPath, driver, similarity, chebi_ids, names, hmdb_ids, inchis, mesh_ids)[source]#

A function that searches inside a series of lists, provided as arguments, and imports the metabolites matching those present in them iterating over a list of files which may contain relevant information to be imported

Parameters

all_files (list) – A list of all the posible files where we want to look for info
Neo4JImportPath (str) – The path which Neo4J will use to import data
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
similarity (bool) – Whether to use similarity as a measure to import or not
chebi_ids (list) – A list of all the ChEBI_ID which are considered a reason to import
names (list) – A list of all the Name which are considered a reason to import
hmdb_ids (list) – A list of all the HMDB_ID which are considered a reason to import
inchis (list) – A list of all the InChI which are considered a reason to import
mesh_ids (list) – A list of all the MeSH_ID which are considered a reason to import

import_based_on_index(databasefolder, Neo4JImportPath, driver, similarity, chebi_ids, names, hmdb_ids, inchis, mesh_ids)[source]#

A function that searches inside a series of lists, provided as arguments, and imports the metabolites matching those present in them using a JSON file to map the bits of the databases where the relevant information lies

Parameters

databasefolder (str) – The main folder where all the databases we will be using are to be found There must be an index.json file located in databasefolder/index.json
Neo4JImportPath (str) – The path which Neo4J will use to import data
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
similarity (bool) – Whether to use similarity as a measure to import or not
chebi_ids (list) – A list of all the ChEBI_ID which are considered a reason to import
names (list) – A list of all the Name which are considered a reason to import
hmdb_ids (list) – A list of all the HMDB_ID which are considered a reason to import
inchis (list) – A list of all the InChI which are considered a reason to import
mesh_ids (list) – A list of all the MeSH_ID which are considered a reason to import

improve_search_terms(driver, chebi_ids, names, hmdb_ids, inchis, mesh_ids)[source]#

Improves the search terms already provided to the CanGraph programme by processing the text stings and finding synonyms in various platforms

Parameters

driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
chebi_ids (str) – A string of “;” separated values of all the ChEBI_ID representing the current metabolite
names (list) – A string of “;” separated values of all the Name representing the current metabolite
hmdb_ids (list) – A string of “;” separated values of all the HMDB_ID representing the current metabolite
inchis (list) – A string of “;” separated values of all the InChI representing the current metabolite
mesh_ids (list) – A string of “;” separated values of all the MeSH_ID representing the current metabolite

Returns

A list containing [ chebi_ids, names, hmdb_ids, inchis, mesh_ids ], with all their synonyms

Return type

improve_search_terms_with_cts(query, query_type, chebi_ids, names, hmdb_ids, inchis, mesh_ids)[source]#

Improves the search terms already provided to the CanGraph programme by using The Chemical Translation Service to find synonyms in IDs

Parameters

query (str) – The term we are currently querying for
query_type (str) – The kind of query to search; one of [“ChEBI_ID”, “HMDB_ID”, “Name”, “InChI”, “MeSH_ID”]
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
chebi_ids (str) – A string of “;” separated values of all the ChEBI_ID representing the current metabolite
names (list) – A string of “;” separated values of all the Name representing the current metabolite
hmdb_ids (list) – A string of “;” separated values of all the HMDB_ID representing the current metabolite
inchis (list) – A string of “;” separated values of all the InChI representing the current metabolite
mesh_ids (list) – A string of “;” separated values of all the MeSH_ID representing the current metabolite

Returns

A list containing [ chebi_ids, names, hmdb_ids, inchis, mesh_ids ], with all their synonyms

Return type

improve_search_terms_with_metanetx(query, query_type, driver, chebi_ids, names, hmdb_ids, inchis, mesh_ids)[source]#

Improves the search terms already provided to the CanGraph programme by using the MetaNetX web service to find synonyms in IDs

Parameters

query (str) – The term we are currently querying for
query_type (str) – The kind of query to search; one of [“ChEBI_ID”, “HMDB_ID”, “Name”, “InChI”, “MeSH_ID”]
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
chebi_ids (str) – A string of “;” separated values of all the ChEBI_ID representing the current metabolite
names (list) – A string of “;” separated values of all the Name representing the current metabolite
hmdb_ids (list) – A string of “;” separated values of all the HMDB_ID representing the current metabolite
inchis (list) – A string of “;” separated values of all the InChI representing the current metabolite
mesh_ids (list) – A string of “;” separated values of all the MeSH_ID representing the current metabolite

Returns

A list containing [ chebi_ids, names, hmdb_ids, inchis, mesh_ids ], with all their synonyms

Return type

link_to_original_data(item_type, item, import_based_on)[source]#

Links a recently-imported metabolite to the original data (that which caused it to be imported) by creating an ÒriginalMetabolite node that is (n)-[r:ORIGINALLY_IDENTIFIED_AS]->(a) related to the imported data

Parameters

tx (neo4j.Session) – The session under which the driver is running
item_type (str) – The property to match in the Neo4J DataBase
item (dict) – The value of property `item_type`
import_based_on (list) – A list of the methods that turned out to be valid for import, such as Name, ChEBI_ID…

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

main()[source]#: The function that executes the code

Note

This function disables rdkit’s log messages, since rdkit seems to dislike the way some of the InChI strings it is getting from the databases are formatted

Todo

CAMBIAR NOMBRE A LOS MESH PARA INDICAR EL TIPO. AÑADIR NAME A LOS WIKIDATA

Todo

FIX THE REPEAT TRANSACTION FUNCTION

Todo

Match partial InChI based on DICE-MACCS

Todo

QUE FUNCIONE -> ACTUALMENTE ESTA SECCION RALENTIZA MAZO

Todo

CHECK APOC IS INSTALLED

Todo

FIX MAIN

Todo

MERGE BY INCHI, METANETX ID

Todo

Fix find_protein_interactions_in_metanetx

Todo

Mover esa funcion de setup a misc

Todo

EDIT conf.py

Todo

Document the following Schema Changes: * For Subject, we have a composite PK: Exposome_Explorer_ID, Age, Gender e Information * Now, more diseases will have a WikiData_ID and a related MeSH. This will help with networking. And, this diseases dont even need to be a part of a cancer! * The Gene nodes no longer exist in the full db? -> They do

CanGraph.miscelaneous module#

A python module that provides a collection of functions to be used across the different scripts present in the CanGraph package, with various, useful functionalities

call_db_schema_visualization()[source]#

Shows the DB Schema. This function is intended to be run only in Neo4J’s console, since it produces no output when called from the driver.

Parameters: tx (neo4j.Session) – The session under which the driver is running

Todo

Make it download the image

check_file(filepath)[source]#

Checks for the presence of a file or folder. If it exists, it returns the filepath; if it doesn’t, it raises an argparse.ArgumentTypeError, which tells argparse how to process file exclussion

Note

Perhaps its not ideal, but I will be using this also to check for file existence throughout the CanGraph project, although the error type might not be correct

Parameters: filepath (str) – The path of the file or folder whose existence is being checked
Returns: The original filepath, which now is sure to exist
Return type: str
Raises: argparse.ArgumentTypeError – If the file does not exist

check_neo4j_protocol(string)[source]#

Checks that a given string starts with any of the protocols accepted by the neo4j.Driver

Parameters: string (str) – A string, which will normally represent the neo4j adress
Returns: The same string that was provided as an argument (required by argparse.ArgumentParser)
Return type: str
Raises: argparse.ArgumentTypeError – If the string is not of the correct protocol

clean_database()[source]#

A CYPHER query that gets all the nodes in a Neo4J database and removes them, in transactions of 100 rows to alleviate memory load

Returns: A text chain that represents the CYPHER query with the desired output. This can be run using: neo4j.Session.run
Return type: str

Note

This is an autocommit transaction. This means that, in order to not keep data in memory (and make running it with a huge amount of data) more efficient, you will need to add `:auto ` when calling it from the Neo4J browser, or call it as using neo4j.Session.run from the driver.

connect_to_neo4j(port='bolt://localhost:7687', username='neo4j', password='neo4j')[source]#

A function that establishes a connection to the neo4j server and returns a Driver into which transactions can be passed

Parameters

port (str) – The URL where the database is available to be queried. It must be of bolt:// format
username (str) – the username for your neo4j database; by default, neo4j
password (str) – the password for your database; by default, neo4j

Returns

An instance of Neo4J’s Bolt Driver that can be used

Return type

neo4j.Driver

Note

Since this is a really short function, this doesn’t really simplify the code that much, but it makes it much more re-usable and understandable

countlines(start, header=True, lines=0, begin_start=None)[source]#

A function that counts all the lines of code present in a given directory; useful to show off in Sphinx Docs

Parameters

start (str) – The directory from which to start the line counting
header (bool) – whether to print a header, or not
lines (int) – Number of lines already counted; do not fill, only for recursion
begin_start (str) – The subdirectory currently in use; do not fill, only for recursion

Returns

The number of lines present in start

Return type

int

See also

This function was taken from StackOverflow #38543709

create_n10s_graphconfig()[source]#

A CYPHER query that creates a neosemantics (n10s) constraint to hold all the RDF we will import.

Parameters: tx (neo4j.Session) – The session under which the driver is running
Returns: A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
Return type: neo4j.Result

See also

More information on this approach can be found in Neosemantics’ 101 Guide and in Neo4J’s guide on how to import data from Wikidata , where this approach was taken from

Deprecated since version 0.9: Since we are importing based on apoc.load.jsonParams, this is not needed anymore

download(url, folder)[source]#

Downloads a file from the internet into a given folder

Parameters

url (str) – The Uniform Resource Locator for the Zipfile to be downloaded and unzipped
folder (str) – The folder under which the file will be stored.

Returns

The path where the file we just downloaded has been stored

Return type

download_and_untargz(url, folder)[source]#

Downloads and unzips a given tar.gz from the internet

Parameters

url (str) – The Uniform Resource Locator for the tar.gz to be downloaded and unzipped
folder (str) – The folder under which the file will be stored.

Returns

This function downloads and unzips the file in the desired folder, but does not produce any particular return.

download_and_unzip(url, folder)[source]#

Downloads and unzips a given Zipfile from the internet; useful for databases which provide zip access.

Parameters

url (str) – The Uniform Resource Locator for the Zipfile to be downloaded and unzipped
folder (str) – The folder under which the file will be stored.

Returns

This function downloads and unzips the file in the desired folder, but does not produce any particular return.

See also

Code snippets for this function were taken from Shyamal Vaderia’s Github and from StackOverflow #32123394

export_graphml(exportname)[source]#

Exports a Neo4J graph to GraphML format. The graph will be exported to Neo4JImportPath

Parameters

exportname (str) – The name for the exported file, which will be saved under ./Neo4JImportPath/

Returns

A Neo4J connexion to the database that exports the file, using batch optimizations and: smaller batch sizes to try to keep the impact on memory use low

Return type

Note

for this to work, you HAVE TO have APOC availaible on your Neo4J installation

get_import_path(driver)[source]#

A function that runs an autocommit transaction to get Neo4J’s Import Path

Note

By doing the Neo4JImportPath search this way (in two functions), we are able to run the query as a :obj: execute_read, which, unlike autocommit transactions, allows the query to be better controlled, and repeated in case it fails.

Parameters: driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
Returns: Neo4J’s Import Path, i.e., where Neo4J will pick up files to be imported using the `file:///` schema
Return type: str

import_graphml(importname)[source]#

Imports a GraphML file into a Neo4J graph. The file has to be located in Neo4JImportPath

Parameters

importname (str) – The name for the file to be imported, which must be under ./Neo4JImportPath/

Returns

A Neo4J connexion to the database that imports the file, using batch optimizations and: smaller batch sizes to try to keep the impact on memory use low

Return type

Note

for this to work, you HAVE TO have APOC availaible on your Neo4J installation

kill_neo4j(neo4j_home='neo4j')[source]#

A simple function that kills any process that was started using a cmd argument including “neo4j”

Parameters: neo4j_home (str) – the installation directory for the neo4j program; by default, neo4j

Warning

This function may unintendedly kill any command run from the neo4j folder. This is unfortunate, but the creation of this function was essential given that neo4j stop does not work properly; instead of dying, the process lingers on, interfering with find_neo4j_installation_status and hindering the main program

manage_transaction(tx, driver, num_retries=10, neo4j_home='neo4j', **kwargs)[source]#

A function that repeats transactions whenever an error is found. This may make an incorrect script unnecessarily repeat; however, since the error is printed, one can discriminate those out, and the function remains helpful to prevent SPARQL Read Time-Outs.

It will also re-start neo4j in case it randomly dies while executing a query.

Parameters

tx (str) – The transaction that we desire to run, specified as a CYPHER query
driver (neo4j.Driver) – Neo4J’s Bolt Driver currently in use
num_retries (int) – The number of times that we wish the transaction to be retried
neo4j_home (str) – the installation directory for the neo4j program; by default, neo4j
**kwargs – Any number of arbitrary keyword arguments

Raises

Exception – An exception telling the user that the maximum number of retries has been exceded, if such a thing happens

Returns

The response from the Neo4J Database

Return type

Note

This function does not accept args, but only kwargs (named keyword arguments). Thus, if you wish to add a parameter (say, number, you should add it as: number=33

merge_duplicate_nodes(node_types, node_property, optional_condition='', more_props='')[source]#

Removes any two nodes of any given `node_type` with the same `condition`.

Parameters

node_types (str) – The labels of the nodes that will be selected for merging; i.e. n:Fruit OR n:Vegetable
node_property (str) – The node properties used for collecting, if not using all properties.
optional_condition (str) – An optional Neo4J Statement, starting with “AND”, to be added after the WHERE clause.

Returns

A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.

Return type

Warning

When using, take good care on how the keys names are written: sometimes, if a key is not present, all nodes will be merged!

old_sleep_with_counter(seconds, step=20, message='Waiting...')[source]#

A function that waits while showing a cute animation, but without using the ``alive_progress` module

Note

This function interacts weirdly with slurn; I’d recommend to not use it on the HPC

Parameters

seconds (int) – The number of seconds that we would like the program to wait for
step (int) – The number times the counter wheel will turn in a second; by default, 20
message (str) – An optional, text message to add to the waiting period

purge_database(driver, method=['merge', 'delete'])[source]#

A series of commands that purge a database, removing unnecessary, duplicated or empty nodes and merging those without required properties. This has been converted into a common function to standarize the ways the nodes are merged.

Args:
driver (neo4j.Driver): Neo4J’s Bolt Driver currently in use method (list): The part of the function that we want to execute; if [“delete”], only call

queries that delete nodes; if [“merge”], only call those that merge; if both, do both

Returns: This function modifies the Neo4J Database as desired, but does not produce any particular return.

Warning

When modifying, take good care on how the keys names are written: with merge_duplicate_nodes, sometimes, if a key is not present, all nodes will be merged!

remove_ExternalEquivalent()[source]#

Removes all nodes of type: ExternalEquivalent from he DataBase; since this do not add new info, one might consider them not useful.

Parameters: tx (neo4j.Session) – The session under which the driver is running
Returns: A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
Return type: neo4j.Result

remove_duplicate_relationships()[source]#

Removes duplicated relationships between ANY existing pair of nodes.

Parameters: tx (neo4j.Session) – The session under which the driver is running
Returns: A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
Return type: neo4j.Result

Note

Only deletes DIRECTED relationships between THE SAME nodes, combining their properties

See also

This way of working has been taken from StackOverflow #18724939

remove_n10s_graphconfig()[source]#

Removes the “_GraphConfig” node, which is necessary for querying SPARQL endpoints but not at all useful in our final export

Parameters: tx (neo4j.Session) – The session under which the driver is running
Returns: A Neo4J connexion to the database that modifies it according to the CYPHER statement contained in the function.
Return type: neo4j.Result

Deprecated since version 0.9: Since we are importing based on apoc.load.jsonParams, this is not needed anymore

restart_neo4j(neo4j_home='neo4j')[source]#

A simple function that (re)starts a neo4j server and returns its bolt adress

Parameters: neo4j_home (str) – the installation directory for the neo4j program; by default, neo4j

Note

Re-starting is better than starting, as it tries to kills old sessions (a task at which it fails miserably, thus the need for kill_neo4j), and, most importantly, because it returns the currently used bolt port

scan_folder(folder_path)[source]#

Scans a folder and finds all the files present in it

Parameters: folder_path (str) – The folder that is to be scanned
Returns: A list of all the files in the folder, listed by their absolute path
Return type: list

sleep_with_counter(seconds, step=20, message='Waiting...')[source]#

A function that waits while showing a cute animation

Parameters

seconds (int) – The number of seconds that we would like the program to wait for
step (int) – The number times the counter wheel will turn in a second; by default, 20
message (str) – An optional, text message to add to the waiting period

split_csv(filename, folder, sep=',', sep_out=',', startFrom=0, withStepsOf=1)[source]#

Splits a given .csv/tsv file in n smaller csv files, one for each row on the original file, so that it does not crash when processing it. It also allows to start reading from `startFrom` lines

Parameters

filepath (str) – The path to the file that needs to be xplitted
splittag (str) – The tag based on which the file will be split
bigtag (str) – The main tag of the file, which needs to be re-added.

Returns

The number of files that have been produced from the original

Return type

int

Warning

The original file will be removed

split_xml(filepath, splittag, bigtag)[source]#

Splits a given .xml file in n smaller XML files, one for each splittag section that is pressent in the original file, which should be of type bigtag. For example, we might have an <hmdb> file which we want to slit based on the <metabolite> items therein contained. Ths is so that Neo4J does not crash when processing it.

Parameters

filepath (str) – The path to the file that needs to be xplitted
splittag (str) – The tag based on which the file will be split
bigtag (str) – The main tag of the file, which needs to be re-added.

Returns

The number of files that have been produced from the original

Return type

int

Warning

The original file will be removed

untargz(file_path, folder)[source]#

Untargzs a file present at a given file_path into a given folder

Parameters

url (str) – The Uniform Resource Locator for the Tarfile to be untargz
folder (str) – The folder under which the file will be stored.

Returns

The path where the file we just untargz has been stored

Return type

unzip(file_path, folder)[source]#

Unizps a file present at a given file_path into a given folder

Parameters

url (str) – The Uniform Resource Locator for the Zipfile to be unzipped
folder (str) – The folder under which the file will be stored.

Returns

The path where the file we just unzipped has been stored

Return type

CanGraph.setup module#

A python module that prepares the local environment, to be able to run the main and deploy functions. This can be either run in an interactive way, requiring user input; or in a automatic way, in order to pre-configure things, for example, if you are using the singularity package

CanGraph.setup Usage#

To use this module:

A python module that prepares the local environment, to be able to run the CanGraph.main and CanGraph.deploy functions.

usage: python3 setup.py [-h] [-i] [-a] [--dbfolder [DBFOLDER]] [--git [GIT]]
                        [--requirements [REQUIREMENTS]] [-n [NEO4J]]
                        [--neo4j_username [NEO4J_USERNAME]]
                        [--neo4j_password [NEO4J_PASSWORD]]

Named Arguments#

-i, --interactive: tells the script if it wants interaction from the user and information shown to them; similar to –verbose
-a, --all: runs all the options below at once; equivalent to -dgnr; it DOES NOT activate the interactive mode
--dbfolder: set up the databases from which the program will pull its info using the provided folder
--git: prepare the git environment for the deploy script using the provided git folder
--requirements: installs all the requirements needed for all the possible options from the given requirements file
-n, --neo4j: set up the neo4j local environment, to run from the provided folder
--neo4j_username: the username for the neo4j database
--neo4j_password: the password for the neo4j database

CanGraph.setup Functions#

This module is comprised of:

args_parser()[source]#

Parses the command line arguments into a more usable form, providing help and more

Returns: A dictionary of the different possible options for the program as keys, specifying their set value. If no command-line arguments are provided, the help message is shown and the program exits.
Return type: argparse.ArgumentParser

Note

Note that, in Google Docstrings, if you want a multi-line Returns comment, you have to start it in a different line :(

Note

The return must be of type argparse.ArgumentParser for the argparse directive to work and auto-gen docs

Note

The --all```option has to be adressed outside of this function in order to not mess up the ``argparse directive in sphinx

Note

change_neo4j_password(new_password, old_password='neo4j', user='neo4j', database='system', neo4j_home='neo4j')[source]#

Changes the neo4j password for user user, from old_password to new_password, by using a simple query in cypher-shell

Parameters

neo4j_home (str) – the installation directory for the neo4j program; by default, neo4j
new_password (str) – the new password for the database
old_password (str) – the old password for the database, needed for identification.
user (str) – the user for which the password is being changed.
database (str) – the name of the database for which we want to modify the password. By default, it is system, since Neo4J’s community edition only allows for one database

Warning

DO NOT REMOVE THE TRY-EXCEPT BLOCK THAT ATTEMPTS TO CONNECT TO NEO4J: It somehow magically maked the passsword change work. IT WILL NOT WORK IF THAT LINES ARE NOT PRESENT

check_exposome_files(databasefolder='./DataBases')[source]#

Checks for the presence of all the files that should be in “databasefolder/ExposomeExplorer”` for the ExposomeExplorer part of the script to run

Parameters: databasefolder (str) – The main folder where all the databases we will be using are to be found
Returns: One of [“Splitted, “UnSplitted”, “Error”]. If “Error”, Exposome-Explorer should not be used as a data source; if “UnSplitted”, please split the “components” file.
Return type: str

configure_neo4j(neo4j_home='neo4j')[source]#

Modifies the Neo4J conf file according to some recommendations provided by memrec, neo4j’s memory recommendator. It also enables the Awesome Procedures On Cypher (APOC) plugin from Neo4j Labs, and enables other basic confs such as file export and import or bigger timeouts

Parameters: neo4j_home (str) – the installation directory for the neo4j program; by default, neo4j

Note

In order to make the setup more consistent, this function also forces the Neo4JImportPath (dbms.directories.import) to be presented in an absolute way, instead of being relative to neo4j_home

final_message(interactive=False)[source]#

Prompts the user with a final message.

Parameters: interactive (bool) – Whether the session is set to be interactive or not

find_neo4j_installation_status(neo4j_home='neo4j', neo4j_username='neo4j', neo4j_password='neo4j')[source]#

Finds the installation status of Neo4J by trying to use it normally, and analyzing any thrown exceptions

Parameters

neo4j_home (str) – the installation directory for the neo4j program; by default, neo4j
neo4j_username (str) – the username for the neo4j database; by default neo4j
neo4j_password (str) – the password for the neo4j database; by default neo4j

Returns

A list of two booleans: whether neo4j exists at neo4j_home, and whether the supplied credentials are valid or not

Return type

initial_message()[source]#

Prompts the user with an initial message if the session is set to be interactive.

Parameters: interactive (bool) – Whether the session is set to be interactive or not

install_neo4j(neo4j_home='neo4j', interactive=False, version='4.4.0')[source]#

Installs the neo4j database program in the neo4j_home folder, by getting it from the internet according to the Operating System the script is been run in (aims for multi-platform!)

Parameters

neo4j_home (str) – the installation directory for the neo4j program; by default, neo4j
interactive (str) – tells the script if it wants interaction from the user and information shown to them
version (str) – the version of the neo4j software that we wish to install

install_packages(requirements_file=None, package_name=None, interactive=False)[source]#

Automates installing packages using PIP

Parameters

requirements_file (str) – The path to a “requirements.txt” file, containing one requirement per line
package_name (str) – A package to be installed
interactive (bool) – Whether the session is set to be interactive or not

Raises

ValueError – If neither a requirements_file nor a package_name is provided

main()[source]#: The function that executes the code

setup_database_index(databasefolder='./DataBases')[source]#

Prepares the index file for all the databases present in the databasefolder folder, which will helpfully reduce processing time a lot

Parameters: databasefolder (str) – The main folder where all the databases we will be using are to be found
Returns: A dictionary containing the index for all the databases in databasefolder. This index will be written as JSON in databasefolder/index.json
Return type: dict

setup_databases(databasefolder='./DataBases', interactive=False)[source]#

Set Up the databasefolder from where the main script will take its data. It does so by creating or removing and re-creating the databasefolder, and putting inside it, or asking/checking if the user has put inside, the necessary files

Parameters

databasefolder (str) – The main folder where all the databases we will be using are to be found
interactive (bool) – Whether the session is set to be interactive or not

setup_drugbank(databasefolder='./DataBases', interactive=False)[source]#

Sets up the files relative to the SMPDB database in the databasefolder, splitting them for easier processing later on.

Parameters

databasefolder (str) – The main folder where all the databases we will be using are to be found
interactive (bool) – Whether the session is set to be interactive or not

Returns

True if everything went okay; False otherwise. If False, DrugBank should not be used as a data source

Return type

Warning

When updating the DrugBank DataBase Version, please edit this function to reflect the correct number of files

setup_exposome(databasefolder='./DataBases', interactive=False)[source]#

Sets up the files relative to the Exposome Explorer database in the databasefolder, splitting them for easier processing later on. If the session is set to be interactive, the user will be given time to add the files themselves; if not, the full suite of necessary files will be checked for their presence in databasefolder

Then, the “components” file will be splitted into one record oer line, as main requires

Parameters

databasefolder (str) – The main folder where all the databases we will be using are to be found
interactive (bool) – Whether the session is set to be interactive or not

Returns

True if everything went okay; False otherwise. If False, Exposome-Explorer should not be used as a data source

Return type

Warning

When updating the Exposome Explorer DataBase Version, please edit check_exposome_files to reflect the correct number of files

setup_folders(databasefolder='./DataBases', interactive=False)[source]#

Creates the databasefolder if it does not exist. If it does, it either asks before overwriting in interactive mode, or directly overwrites in auto mode.

Parameters

databasefolder (str) – The main folder where all the databases we will be using are to be found
interactive (bool) – Whether the session is set to be interactive or not

Raises

ValueError – If the Databases folder already exists (so as not to overwrite)

Returns

True if successful, False otherwise.

Return type

setup_git(path_to_repo='.git')[source]#

Set Up the Git environment for the deploy script. It does so by removing any existing remotes and setting two new ones: github and codeberg, with their respective branches

Parameters: path_to_repo (str) – The path to the Git repo; by default, .git

setup_hmdb(databasefolder='./DataBases')[source]#

Sets up the files relative to the HMDB database in the databasefolder, splitting them for easier processing later on.

Parameters: databasefolder (str) – The main folder where all the databases we will be using are to be found
Returns: True if everything went okay; False otherwise. If False, DrugBank should not be used as a data source
Return type: bool

Warning

When updating the Exposome Explorer DataBase Version, please edit check_exposome_files to reflect the correct number of files

setup_neo4j(neo4j_home='neo4j', neo4j_username='neo4j', neo4j_password='neo4j', interactive=False)[source]#

Sets ups the neo4j environment in neo4j_home, so that the functions in main can propperly function. Using the functions present in this module, it finds if neo4j is installed with default credentials, and, if not, it installs it, changing the default password to a new one, and returning its value

Parameters

neo4j_home (str) – the installation directory for the neo4j program; by default, neo4j
neo4j_username (str) – the username for the neo4j database; by default neo4j
neo4j_password (str) – the password for the neo4j database; by default neo4j
interactive (str) – tells the script if it wants interaction from the user and information shown to them

Returns

The password that was set up for the new neo4j database. This is also written to .neo4jpassword

Return type