Skip to main content
Warning: You are using the test version of PyPI. This is a pre-production deployment of Warehouse. Changes made here affect the production instance of TestPyPI (testpypi.python.org).
Help us improve Python packaging - Donate today!

The aim of this library is to safely handle the decoding/encoding process of SBML identifiers and SBML entities names.

Project Description
The aim of this library is to safely handle the decoding/encoding process of SBML identifiers and SBML entities names.

# Installation

From pypi:

pip install pyliss-id-conv
pip uninstall pyliss-id-conv

From git repository:

make install
make uninstall

.. note:: The module will be installed in the current Python environment
(Python2 or Python3 depending on your distribution or your virtual environment).


# Identifiers

## Universal functions


- `universal_decoder(identifier, source_database=None)`

<br>

:param arg1: encoded id
:param arg2: source database
:type arg1: <str>
:type arg2: <str>
:return: tuple with decoded id and compartment.
Please note that compartment could be None.
:rtype: <tuple <str>, <str>>



Handle the decoding of identifier according to the given database (`metacyc`, `bigg`, `None`).

See `decode_bigg()`, `decode_metacyc()` and `decode_unknown()` for further information.

.. note:: If the returned identifier has a length greater than 80
characters, it will be transformed in md5 hex digest.

Example:

:::python
>>> universal_decoder('M__45__CRESOL__45__METHYLCATECHOL__45__RXN')
(M-CRESOL-METHYLCATECHOL-RXN, None)
>>> universal_decoder('M_tartr__L_e')
('tartr__L', 'e')
>>> universal_decoder('_4M__45__TOTO_c', 'bigg')
('4M-TOTO', 'c')
>>> universal_decoder('S__40_15S_41__45_15_45_Hydroxy_45_5_44_8_44_11_45_cis_45_13_45_trans_45_eicosatetraenoate_c')
('d461e03952d4e7efbd538cc9363691e9', 'c')


---

- `universal_encoder(identifier, source_database=None, **kwargs)`

<br>

:param arg1: decoded id
:param arg2: True (default) if id is a metabolite; False if it is a reaction
:param arg3: compartment info to be concatenated (facultative)
:type arg1: <str>
:type arg2: <boolean>
:type arg3: <str>
:return: encoded id
:rtype: <str>

Handle the encoding of identifier according to the given database (`metacyc`, `bigg`, `None`).

See `encode_bigg()`, `encode_metacyc()` and `encode_unknown()` for further information.

Example:

:::python
>>> universal_encoder('4M-TOTO_c')
_4M__45__TOTO_c
>>> universal_encoder('4M-TOTO_c', 'metacyc')
_4M__45__TOTO_c
>>> universal_encoder('|M-TOTO', source_database="metacyc", compartment='c')
__124__M__45__TOTO_c
>>> universal_encoder('|M-TOTO', source_database=None, compartment='c')
__124__M__45__TOTO_c
>>> universal_encoder('12PPDt', source_database="bigg", metabolite=False, compartment='c')
R_12PPDt_c
>>> universal_encoder('12PPDt', source_database=None, metabolite=False, compartment='c')
_12PPDt_c


## Functions dedicated to Metacyc


- `decode_metacyc(identifier)`

<br>

:param arg1: encoded id
:type arg1: <str>
:return: tuple with decoded id and compartment.
Please note that compartment could be None.
:rtype: <tuple <str>, <str>>

Clean Metacyc dirty identifiers from SBML.<br>
* unicode encoded chars to utf-8<br>
* strip ALL (!) `_` at the begining<br>

Example:

:::python
>>> decode_metacyc('_4M__45__TOTO_c')
('4M-TOTO', 'c')
>>> decode_metacyc('_9__45__cis__45__Epoxycarotenoids')
('9-cis-Epoxycarotenoids', None)

---

- `encode_metacyc(identifier, compartment=None, **kwargs)`

<br>

:param arg1: decoded id
:param arg2: compartment info to be concatenated (facultative)
:type arg1: <str>
:type arg2: <str>
:return: encoded id
:rtype: <str>

Encode identifier to Metacyc SBML format.<br>
* encode non unicode word character to numeric version<br>
* add prefix `_` if first char is a digit<br>
* add suffix `_compartment` if the given parameter is not None<br>

Example:

:::python
>>> encode_metacyc('4M-TOTO_c')
'_4M__45__TOTO_c'
>>> encode_metacyc('|M-TOTO', compartment='c')
'__124__M__45__TOTO_c'


## Functions dedicated to BIGG

- `decode_bigg(identifier)`

<br>

:param arg1: encoded id
:type arg1: <str>
:return: tuple of decoded id and compartment,
None in case of failure (bad prefix). that compartment could be None.
:rtype: <tuple <str>, <str>> or None

Clean BiGG dirty identifiers from SBML.<br>

To resume: `identifier.lstrip('M_').rstrip('_e').rstrip('_b').rstrip('_c').replace('DASH', '')`<br>
* Remove `__DASH__` pattern<br>
* Remove `M_` or `R_` prefix<br>
* Remove `_x` suffix of compartment info<br>

.. warning:: We assume that any compartment is composed of:<br>
* 1 character **ONLY**<br>
* the unique character **IS NOT** a digit<br>

Example:

:::python
>>> decode_bigg('R_EX_12ppd__S_e')
('EX_12ppd__S', 'e')
>>> decode_bigg('R_MAN6Pt6_2')
('MAN6Pt6_2', None),
>>> decode_bigg('_4M__45__TOTO_c')
None

---

- `encode_bigg(identifier, metabolite=True, compartment=None, **kwargs)`

<br>

:param arg1: decoded id
:param arg2: True (default) if id is a metabolite; False if it is a reaction
:param arg3: compartment info to be concatenated (facultative)
:type arg1: <str>
:type arg2: <boolean>
:type arg3: <str>
:return: encoded id
:rtype: <str>

Encode identifier to Metacyc SBML format<br>
* encode non unicode word character to numeric version<br>
* add prefix `M_` in case of metabolite=True (default)<br>
* add prefix `R_` in case of metabolite=False<br>
* add suffix `_compartment` if the given parameter is not None.<br>

Example:

:::python
>>> encode_bigg('12PPDt', metabolite=False, compartment='c')
'R_12PPDt_c'



# Names


This module includes some functions used to clean names from Metacyc raw files.

Documentation & examples about html entities:

- https://docs.python.org/3/library/html.html#html.unescape
- https://docs.python.org/3/library/html.entities.html
- https://alexandre.alapetite.fr/doc-alex/alx_special.html

Examples of raw files export from metacyc:

- Metacyc: `a β-D-galactosyl-(1,4)-N-acetyl-β-D-glucosaminyl-(1-3)-β-D-galactosyl-1,4-β-D-glucosyl-(1↔1)-ceramide`
- Metacyc bdd: `a &beta;-D-galactosyl-(1,4)-N-acetyl-&beta;-D-glucosaminyl-(1-3)-&beta;-D-galactosyl-1,4-&beta;-D-glucosyl-(1&harr;1)-ceramide`
- chebi: `β-D-galactosyl-(1→4)-N-acetyl-β-D-galactosaminyl-(1→3)-β-D-galactosyl-(1→4)-β-D-glucosylceramide`
- chebi ascii: `beta-D-galactosyl-(1->4)-N-acetyl-beta-D-galactosaminyl-(1->3)-beta-D-galactosyl-(1->4)-beta-D-glucosylceramide`

Entities found in Metacyc dump:

{'&pi;', '&alpha;', '&Delta;', '&mu;', '&chi;', '&plusmn;', '&tau;',
'&DElta;', '&omega;', '&zeta;', '&gamma;', '&psi;', '&harr;', '&kappa;',
'&lambda;', '&beta;', '&iota;', '&xi;', '&epsilon;', '&rarr;', '&nu;',
'&delta;'}

Specific conversions in chemistry context:

- &pi => pros
CPD-1823, Nπ-methyl-L-histidine,
The nitrogen atoms of the imidazole ring of histidine are denoted by pros
('near', abbreviated π) and tele ('far', abbreviated τ)
to show their position relative to the side chain

http://goldbook.iupac.org/P04890.html
- &tau => τ => tele
N-METHYL-HISTAMINE &
Nτ-methylhistamine
- &harr => ↔ => <->
- &rarr => → => ->
- &plusmn => ± => +-
CPD-16445 (±)-pavine


HTML entities found in SBML dump of Metacyc:

{'&quot;', '&gt;', '&amp;', '&apos;'}

*False HTML entities* (encoded 2 times) found in SBML dump of Metacyc:

{'&amp;iota;', '&amp;lambda;', '&amp;gamma;', '&amp;omega;', '&amp;pi;',
'&amp;prime;', '&amp;mu;', '&amp;plusmn;', '&amp;tau;', '&amp;chi;',
'&amp;delta;', '&amp;harr;', '&amp;Delta;', '&amp;kappa;', '&amp;alpha;',
'&amp;beta;', '&amp;epsilon;', '&amp;zeta;', '&amp;rarr;', '&amp;psi;',
'&amp;mdash;', '&amp;nu;', '&amp;xi;'}


Commands used on SBML dump from Metacyc to find HTML entities:

- cat metacyc_18.5.xml | egrep -o --color -e '&\w+;' >> metacyc_html_entities.txt
- cat metacyc_18.5.xml | egrep -o --color -e '&\w+;\w+;' >> metacyc_html_false_entities.txt


Example of function used to process the files generated above:

:::python
def test_raw_files_and_functions():
"""This function is used to display html entities in raw files.

.. warning:: All these entities have to be processed before
any import in database.

.. note:: Shell commands used on sbml dump
from Metacyc, to generate problematic files:

- cat metacyc_18.5.xml | egrep -o --color -e '&\w+;' >> metacyc_html_entities.txt
- cat metacyc_18.5.xml | egrep -o --color -e '&\w+;\w+;' >> metacyc_html_false_entities.txt
"""

test_files = ('metacyc_html_entities.txt',
'metacyc_html_false_entities.txt')
for file in test_files:
with open(cm.DIR_DATA + file, 'r', encoding='utf-8') as f:
s = {line.rstrip('\n') for line in f}

print(file, ':\n', s)



---

- `clean_name(identifier)`

Convert html entities in a string to their ascii name.
`clean_name()` is a wrapper of `html_entities_to_names()`; it
also removes leading `_` for identifiers that begin with a digit.


.. note:: For more examples & explanations, please take a look
at the doc of `html_entities_to_names()` function.

Example:

:::python
>>> clean_name('a &beta;-D-galactosyl-(1,4)')
'a beta-D-galactosyl-(1,4)'
>>> clean_name('galactosaminyl-&amp;alpha;1,3-')
'galactosaminyl-alpha1,3-'

---

- `html_entities_to_names(text)`

Convert html entities in a string to their ascii name.

Example:

:::python
>>> html_entities_to_names('a &beta;-D-galactosyl-(1,4)')
'a beta-D-galactosyl-(1,4)'
>>> html_entities_to_names('galactosaminyl-&amp;alpha;1,3-')
'galactosaminyl-&alpha;1,3-'
>>> html_entities_to_names(html_entities_to_names('galactosaminyl-&amp;alpha;1,3-'))
'galactosaminyl-alpha1,3-'


.. note:: In IUPAC, these prefixes are not supposed to be in upper case;
So we put them in lower case before the convertion.

.. note:: For some html entities, we have to correct their translation,
according to the chemical nomenclature.

Examples:

- 'pi': '(pros)' at least 1 component with multiples (): CPD-1823
- 'tau': '(tele)'
- 'harr': '<->'
- 'rarr': '->'
- 'plusmn': '+-' - should be a demi-quadratin
- 'amp': '&'
- 'apos': "'"
- 'quot': '"'
- 'gt': '>'
- 'lt': '<'
- 'prime': "'"
- 'mdash': '-' Replace cadratin by quart-quadratin
- 'ndash': '-' Replace demi-quadratin by quart-quadratin


---

- `html_entities_to_utf8(text)`

Convert html entities in a string to utf8 character.

Example:

:::python
>>> html_entities_to_utf8('a &beta;-D-galactosyl-(1,4)')
'a β-D-galactosyl-(1,4)'
>>> html_entities_to_utf8('galactosaminyl-&amp;alpha;1,3-')
'galactosaminyl-&alpha;1,3-'
>>> html_entities_to_utf8(html_entities_to_utf8('galactosaminyl-&amp;alpha;1,3-'))
'galactosaminyl-α1,3-'

.. note:: In IUPAC, these prefixes are not supposed to be in upper case;
So we put them in lower case before the convertion.
Release History

Release History

This version
History Node

0.0.1

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
pylissIdConv-0.0.1.tar.gz (13.4 kB) Copy SHA256 Checksum SHA256 Source Mar 29, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting