SPDX License List Match Guidelines
This documents provides guidelines for matching licenses to licenses found on the SPDX License List. There is no intent here to make a judgment or interpretation, but merely to ensure that when one SPDX user identifies a license as “BSD 3-clause,” for example, it is indeed the same license as what someone else identifies as “BSD 3-clause” and the same license as what is listed on the SPDX License List.
1. Whitespace
1.1 Purpose: By having a rule regarding whitespace, we avoid the possibility of a non-match due to different spacing of words, line breaks, or paragraphs
1.1.1 Guideline: All whitespace will be treated as a single blank space
2. Capitalization
2.1 Purpose: By having a rule regarding capitalization, we avoid the possibility of a non-match due to lower case or upper case letters in otherwise the same words
2.1.1 Guideline: All upper case and lower case letters will be treated as lower case letters
3. Punctuation
3.1 Purpose: Because punctuation can change the meaning of a sentence, punctuation needs to be included in the matching process
3.1.1 Guideline: Punctuation must be matched.
4. Bullets and Numbering
4.1 Purpose: By having a rule regarding bullets and number, we avoid the possibility of a non-match due to the otherwise same license using bullets instead of numbers for a list of clauses.
4.1.1 Guideline: Where a line starts with a bullet, numbering, or some form of a list item, ignore the list item.
5. Varietal Word Spelling
5.1 Purpose: English uses different spelling for some words. By identifying the spelling variations for words found or likely to be found in licenses, we avoid the possibility of a non-match due to the same word being spelled differently. This list is not meant to be an exhaustive list of all spelling variations, but meant to capture the words most likely to be found in open source software licenses.
5.1.1 Guideline: The words in the following columns are considered equivalent and interchangeable:
| Column 1 | Column 2 |
| 1. Acknowledgement 2. Analog 3. Analyze 4. Artifact 5. Authorization 6. Authorized 7. Caliber 8. Canceled 9. Capitalizations 10. Catalog 11. Categorize 12. Center 13. Emphasized 14. Favor 15. Favorite 16. Fulfill 17. Fulfillment 18. Initialize 19. Judgement 20. Labeling 21. Labor 22. License 23. Maximize 24. Modeled 25. Modeling 26. Offense 27. Optimize 28. Organization 29. Organize 30. Practice 31. Program 32. Realize 33. Recognize 34. Signaling 35. Utilization 36. While 37. Wilfull 38. Noncommercial 39. Percent | 1. Acknowledgment 2. Analogue 3. Analyse 4. Artefact 5. Authorisation 6. Authorised 7. Calibre 8. Cancelled 9. Capitalisations 10. Catalogue 11. Categorise 12. Centre 13. Emphasised 14. Favour 15. Favourite 16. Fulfil 17. Fulfilment 18. Initialise 19. Judgment 20. Labelling 21. Labour 22. Licence 23. Maximise 24. Modelled 25. Modelling 26. Offence 27. Optimise 28. Organisation 29. Organise 30. Practise 31. Programme 32. Realise 33. Recognise 34. Signalling 35. Utilisation 36. Whilst 37. Wilful 38. Non-commercial 39. Per cent |
6. Copyright Symbol
6.1 Purpose: By having a rule regarding the use of “©”, “(c)”, or “copyright,” we avoid the possibility of a mismatch based on these variations.
6.1.1 Guideline: “©”, “(c)”, or “Copyright” will be considered equivalent and interchangeable
7. Copyright Notice
7.1 Purpose: To avoid a license mismatch merely because the copyright notice (usually found above the actual license text) is different. The copyright notice is important information to be recorded elsewhere in the SPDX file, but for the purposes of matching a license to the SPDX License List, it should be ignored because it is not part of the substantive license text.
7.1.1 Guideline: Ignore copyright notices. A copyright notice consists of the following elements, for example: "2012 Copyright, John Doe. All rights reserved." or "(c) 2012 John Doe."
pick up here at 15 Feb 2012 Legal Workstream call
8. License Headers
8.1 Purpose: To identify standard headers that are used to indicate a particular license. These could be specific headers mandated by the license itself or common shorthand ways to indicate the license. (e.g. “This file is licensed under GPL v2.” can then be matched to GPL v2... (need to identify what theses might be and come up with a list.... TBD
8.1.1 Guideline:
9. Verbatim Text
9.1 Purpose: To ensure that when matching licenses to the SPDX License List, the substantive text of the license is the same balanced against disregarding parts of the text that do not alter the substantive text. A conservative approach is taken in regards to rules about replaceable text here.
9.1.1 Guideline: License text must be the same verbatim text (except for the rules stated here). The text must be in the same order, e.g., differently ordered paragraphs would not be considered a match.
9.1.2 Guideline: Certain oft-used licenses (i.e. BSD, Apache 1.1) have inline text that refers to the copyright holder or author generically. Sometimes the actual name of a company or individual is used, yet the rest of the license is exactly the same as a generic version. Where text is highlighted in the licenses below, that highlighted text can be ignored when determining a match. This rule only applies to the licenses and highlighted text identified below:


Handling of Non-ASCII characters
Do we have a guideline how to process characters outside the 7-bit ASCII range? E.g. the german ü character appears in various different encodings depending on the locale used.
Suggested rule 1: We assume all texts are UTF8, whenever there are no UTF8 syntax errors. Otherwise offending characters are stripped and replaced by a single replacement character (To be defined) so that correct UTF8 results.
Suggested rule 2: All UTF8 characters having a corresponding ASCII character are considered to be equal to that ASCII character. (e.g. variations of quotes, hyphens, asterisks, bullets)
Re: Handling Of Non-ASCII Characters
These rules are being developed in terms of characters, rather than byte sequence. I'd prefer to stay away from implementation details such as character encoding for the matching rules.
I think we should add a rule stating that any sequence of characters that whose glyphs appear the same when rendered should be considered equivalent for the purposes of matching.
This will handle the various ways combining characters such as umlauts can be represented in unicode by allowing comparision of the normalized form KC of the strings. It is also general enough that people who dislike unicode (which do exist) can apply the rule to whatever encoding they prefer.
Combined Punctuation and whitespace
suggested rule: any punctuation that is not directly followed by whitespace is treated as if it were followed by whitespace.
Idea is to consider 'foo,bar or baz' equals 'foo, bar or baz'.
Question: What is the exact definition to be used for puncuation. I'd default to use ispunct() from the C-locale, which is any printable character which is not a space or an alphanumeric character.
What is punctuation
On the question of what is punctuation, perhaps we should use the General Punctuation list compiled by the unicode spec.
Re: Bullets and Numbers
This rule could be extended to cover more cases:
a) The definition of 'a line starts with' should clearly say, that any whitespace at the start of the line (e.g. indentation) is ignored.
b) typical comment characters at the start of a line (e.g. /*, //, #, REM, ...) should also be ignored.
c) (ad hoc idea:) any punctuation at the start of a line is ignored.
Preambles/epilogue
How should adding a preamble/epilogue to a license text effect matching?
hyphenation
We should probably point out that any word that is hyphenated to span lines should be consider equivalence to the unhyphenated word.
hyphenation
Suggested Rule 1: all hyphen, endash, mdash, etc characters are considered the same.
Suggested Rule 2: if a line ends in a hyphen, and there is a word directly in front of the hyphen, and a word directly at the beginning of the next line, then the hyphen and whitespace is removed, so that both words are joined into one.
There is a slight risk, that a compound word like e.g. 'built-in' would lose its hyphen, if its hyphen happens to be used for line breaks. I'd say, we can accept that risk.