In the current design, we address the project’s technical requirements and implementation questions.
Taxonomy technical requirements
- Taxonomy format:
- In compliance with a taxonomy editing/visualisation tool
- Schema:
- data schema (normalized):
- Alias titles + attributes
- Canonical titles + attributes
- Alias to canonical title relation
- Child to parent title relation
- Alias to company relation
- Canonical titles to function and function ISO-8K classifications
- schema format: one of the standards, depending on the chosen taxonomy editor tool
- Version control and storage:
- Access permissions management (write / read)
- Version tagging
- Version description
- Testing:
- Data integrity and semantic compliance
- Labelled job descriptions sample for quality evaluation and regression control
- Additional data:
- List of hiring companies
- Documentation
Implementation and Research
- What taxonomy editing & visualisation tools are available/standard?
- RESEARCH:
- Options:
- Protege
- PoolParty (community license if applicable)
- …?
- Questions:
- What data integrity and semantic compliance can it enforce?
- Can it assign taxonomy entities unique identification?
- Visualisation?
- Ease of use?
- Taxonomy and schema formats required?
- DECISIONS:
- Data integrity and semantic validation:
- Nimrod will start working on Python script to implement the validation rules
- Some of the data integrity validations will be implemented with Google sheet built-in validations, including colour coding
- We decided to not pursue further Google sheet visual basic function customization development, as this is too complicated to create and maintain – an external python code seems to be a better fit
- Data storage:
- The CSV format will be the output – the version our users can consume
- We will also load to git the full Google sheet tab, in order to preserve the built-in validations
- Data dictionary: will list aliases and canonical titles, with their identifiers
- Choose documentation platform
- Version control: community’s github
- Versioned data:
- Is the stored format the same as the editable format?
- Can it be easily queried/used by external applications?
- Can one version be easily compared to the previous one?
- Can testing be integrated/enforced as part of a version delivery process?
Editor tool requirements
Taxonomy and taxonomy scheme formats
- Taxonomy format:
- CSV
- Pros:
- little changes to the existing format
- easy to manipulate with code
- may be a good compromise for v1
- Cons:
- does not encode semantic logic, which will need to be validated externally
- many-to-many relations cannot be represented in this format – it becomes too complex and cannot be implemented (e.g. an ambiguous alias will be mapped to a single canonical title)
- If we wish to use a more complex model format, we will need to migrate all users to the new format
- RDF
- Pros: encodes semantic logic; standard for taxonomies
- Cons: requires dedicated taxonomy editor tools
- Taxonomy scheme:
- Options:
- SKOS (standard)
- Protobuf (not standard for taxonomies)
- Decision:
Validation rules
- Data integrity control:
- Type enforcement, including:
- Classification labels come from a closed list of values.
- People manager attribute is boolean.
- etc.
- Deduplication – no two rows are exactly the same.
- Trimming - strings have no preceding and trailing spaces
- Multiple spaces between tokens are forbidden
- Casing: aliases are in lowercase, canonical titles are in title case (all words are capitalized)
- Special characters: mostly alphanumerics are allowed, except a few declared exceptions (/’-) and in the definition attribute, which is free text and can include punctuation.
- Deletions:
Unique identifiers for aliases and for canonical titles (can be handled in a separate worksheet)
- Aliases and canonical_title in the dictionary exist in the taxonomy as well
- alias_company exists in the company data in the canonical_company field
- All entries in the dictionary are unique
- Mandatory fields are filled in
- Semantic logic control requirements:
Access permissions