SPDX License List Match Guidelines

SPDX License List Match Guidelines

This documents provides guidelines for matching licenses to licenses found on the SPDX License List. There is no intent here to make a judgment or interpretation, but merely to ensure that when one SPDX user identifies a license as “BSD 3-clause,” for example, it is indeed the same license as what someone else identifies as “BSD 3-clause” and the same license as what is listed on the SPDX License List.

1. Whitespace

1.1 Purpose: By having a rule regarding whitespace, we avoid the possibility of a non-match due to different spacing of words, line breaks, or paragraphs

1.1.1 Guideline: All whitespace will be treated as a single blank space

2. Capitalization

2.1 Purpose: By having a rule regarding capitalization, we avoid the possibility of a non-match due to lower case or upper case letters in otherwise the same words

2.1.1 Guideline: All upper case and lower case letters will be treated as lower case letters

3. Punctuation

3.1 Purpose: Because punctuation can change the meaning of a sentence, punctuation needs to be included in the matching process

3.1.1 Guideline: Punctuation must be matched.

4. Bullets and Numbering

4.1 Purpose: By having a rule regarding bullets and number, we avoid the possibility of a non-match due to the otherwise same license using bullets instead of numbers for a list of clauses.

4.1.1 Guideline: Where a line starts with a bullet, numbering, or some form of a list item, ignore the list item.

5. Varietal Word Spelling

5.1 Purpose: English uses different spelling for some words. By identifying the spelling variations for words found or likely to be found in licenses, we avoid the possibility of a non-match due to the same word being spelled differently. This list is not meant to be an exhaustive list of all spelling variations, but meant to capture the words most likely to be found in open source software licenses.

5.1.1 Guideline: The words in the following columns are considered equivalent and interchangeable:

Column 1Column 2
1. Acknowledgement
2. Analog
3. Analyze
4. Artifact
5. Authorization
6. Authorized
7. Caliber
8. Canceled
9. Capitalizations
10. Catalog
11. Categorize
12. Center
13. Emphasized
14. Favor
15. Favorite
16. Fulfill
17. Fulfillment
18. Initialize
19. Judgement
20. Labeling
21. Labor
22. License
23. Maximize
24. Modeled
25. Modeling
26. Offense
27. Optimize
28. Organization
29. Organize
30. Practice
31. Program
32. Realize
33. Recognize
34. Signaling
35. Utilization
36. While
37. Wilfull
38. Noncommercial
39. Percent
1. Acknowledgment
2. Analogue
3. Analyse
4. Artefact
5. Authorisation
6. Authorised
7. Calibre
8. Cancelled
9. Capitalisations
10. Catalogue
11. Categorise
12. Centre
13. Emphasised
14. Favour
15. Favourite
16. Fulfil
17. Fulfilment
18. Initialise
19. Judgment
20. Labelling
21. Labour
22. Licence
23. Maximise
24. Modelled
25. Modelling
26. Offence
27. Optimise
28. Organisation
29. Organise
30. Practise
31. Programme
32. Realise
33. Recognise
34. Signalling
35. Utilisation
36. Whilst
37. Wilful
38. Non-commercial
39. Per cent

6. Copyright Symbol

6.1 Purpose: By having a rule regarding the use of “©”, “(c)”, or “copyright,” we avoid the possibility of a mismatch based on these variations.

6.1.1 Guideline: “©”, “(c)”, or “Copyright” will be considered equivalent and interchangeable


7. Copyright Notice

7.1 Purpose: To avoid a license mismatch merely because the copyright notice (usually found above the actual license text) is different. The copyright notice is important information to be recorded elsewhere in the SPDX file, but for the purposes of matching a license to the SPDX License List, it should be ignored because it is not part of the substantive license text.

7.1.1 Guideline: Ignore copyright notices.  A copyright notice consists of the following elements, for example: "2012 Copyright, John Doe. All rights reserved." or "(c) 2012 John Doe."

pick up here at 15 Feb 2012 Legal Workstream call

8. License Headers

8.1 Purpose: To identify standard headers that are used to indicate a particular license. These could be specific headers mandated by the license itself or common shorthand ways to indicate the license. (e.g. “This file is licensed under GPL v2.” can then be matched to GPL v2... (need to identify what theses might be and come up with a list.... TBD

8.1.1 Guideline:

9. Verbatim Text

9.1 Purpose: To ensure that when matching licenses to the SPDX License List, the substantive text of the license is the same balanced against disregarding parts of the text that do not alter the substantive text. A conservative approach is taken in regards to rules about replaceable text here.

9.1.1 Guideline: License text must be the same verbatim text (except for the rules stated here). The text must be in the same order, e.g., differently ordered paragraphs would not be considered a match.

9.1.2 Guideline: Certain oft-used licenses (i.e. BSD, Apache 1.1) have inline text that refers to the copyright holder or author generically. Sometimes the actual name of a company or individual is used, yet the rest of the license is exactly the same as a generic version. Where text is highlighted in the licenses below, that highlighted text can be ignored when determining a match. This rule only applies to the licenses and highlighted text identified below:

Handling of Non-ASCII characters

Do we have a guideline how to process characters outside the 7-bit ASCII range? E.g. the german ü character appears in various different encodings depending on the locale used.

Suggested rule 1: We assume all texts are UTF8, whenever there are no UTF8 syntax errors. Otherwise offending characters are stripped and replaced by a single replacement character (To be defined) so that correct UTF8 results.

Suggested rule 2: All UTF8 characters having a corresponding ASCII character are considered to be equal to that ASCII character. (e.g. variations of quotes, hyphens, asterisks, bullets)

Re: Handling Of Non-ASCII Characters

These rules are being developed in terms of characters, rather than byte sequence. I'd prefer to stay away from implementation details such as character encoding for the matching rules.

I think we should add a rule stating that any sequence of characters that whose glyphs appear the same when rendered should be considered equivalent for the purposes of matching.

This will handle the various ways combining characters such as umlauts can be represented in unicode by allowing comparision of the normalized form KC of the strings.  It is also general enough that people who dislike unicode (which do exist) can apply the rule to whatever encoding they prefer.

Combined Punctuation and whitespace

suggested rule: any punctuation that is not directly followed by whitespace is treated as if it were followed by whitespace.

Idea is to consider 'foo,bar or baz' equals 'foo, bar or baz'.

Question: What is the exact definition to be used for puncuation. I'd default to use ispunct() from the C-locale, which is any printable character which is not a space or an alphanumeric  character.

What is punctuation

On the question of what is punctuation, perhaps we should use the General Punctuation list compiled by the unicode spec

Re: Bullets and Numbers

This rule could be extended to cover more cases:

a) The definition of 'a line starts with' should clearly say, that any whitespace at the start of the line (e.g. indentation) is ignored.

b) typical comment characters at the start of a line (e.g. /*, //, #, REM, ...) should also be ignored.

c) (ad hoc idea:) any punctuation at the start of a line is ignored.

Preambles/epilogue

How should adding a preamble/epilogue to a license text effect matching?

hyphenation

We should probably point out that any word that is hyphenated to span lines should be consider equivalence to the unhyphenated word.

hyphenation

Suggested Rule 1: all hyphen, endash, mdash, etc characters are considered the same.

Suggested Rule 2: if a line ends in a hyphen, and there is a word directly in front of the hyphen, and a word directly at the beginning of the next line, then the hyphen and whitespace is removed, so that both words are joined into one.

There is a slight risk, that a compound word like e.g. 'built-in' would lose its hyphen, if its hyphen happens to be used for line breaks. I'd say, we can accept that risk.