<aside>
💡
Due to time and resource restrictions, and the limited data size, we implemented semi-automatic validation. For future versions, this process will be automated to run validations against dictionary.csv (data integrity) and master.csv (semantic relations integrity)
</aside>
Validation rules
- Taxonomiy data integrity control:
- Type enforcement, including:
- Classification labels come from a closed list of values.
- People manager attribute is boolean.
- etc.
- Deduplication – no two rows are exactly the same.
- Trimming - strings have no preceding and trailing spaces
- Multiple spaces between tokens are forbidden
- Casing: aliases are in lowercase, canonical titles are in title case (all words are capitalized)
- Special characters: mostly alphanumerics are allowed, except a few declared exceptions (/’-) and in the definition attribute, which is free text and can include punctuation.
- Deletions:
- Full entry removal is forbidden
- allowed by turning is_active field to FALSE
Unique identifiers for aliases and for canonical titles (can be handled in a separate worksheet)
- Aliases and canonical_title in the dictionary exist in the taxonomy as well
- alias_company exists in the company data in the canonical_company field
- All entries in the dictionary are unique
- Mandatory fields are filled in
Done (using Google spreadsheet validation dropdown menus in the taxonomy)
Done (no duplicate alias found in the taxonomy)
Done (in dictionary with Google spreadsheet function)
Done (in dictionary, using Find regex)
Done (in dictionary)
Done
Irrelevant (v1 as the baseline)
Done (vlookup)
Done (vlookup)
Done
Done
Done (136 unique aliases)
Done (pivot table)
OK
Done ( (vlookup)
OK
Done (pivot # of unique parent maps to canonical title)
Done (using Gemini)
Done
Done (split labels by “|”, then vlookup)