Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 3626 add option enable license content; disable by default #3631

Merged
merged 15 commits into from
Feb 5, 2025

Conversation

spiffcs
Copy link
Contributor

@spiffcs spiffcs commented Jan 29, 2025

Description

With #3366 landing the other day we need a sensible way to toggle the new content option before the next syft release.

This PR adds a new top level syft configuration for licenses

license:
  # include the content of a license in the SBOM when syft
  # cannot determine a valid SPDX ID for the given license (env: SYFT_LICENSE_INCLUDE_UNKNOWN_LICENSE_CONTENT)
  include-unknown-license-content: false

  # adjust the percent as a fraction of the total text, in normalized words, that
  # matches any valid license for the given inputs, expressed as a percentage across all of the licenses matched. (env: SYFT_LICENSE_LICENSE_COVERAGE)
  license-coverage: 75

The new syft configuration keeps the current behavior where license content is disabled by default.
Content for unknown licenses to be included in the SBOM can be enabled on request

When this lands I plan on updating the wiki under configuration with examples:
https://github.com/anchore/syft/wiki/configuration

How to Validate

I've added a new CLI image that has a jar with an unknown license.

Reviewers can scan this jar using syft on this branch to see how contents can be optionally toggled on:

Content Enabled
SYFT_LICENSE_INCLUDE_UNKNOWN_LICENSE_CONTENT=true go run cmd/syft/main.go test/cli/test-fixtures/image-unknown-licenses/WowLibrary-1.0.0.jar -o json
 ✔ Indexed file system                                                     test/cli/test-fixtures/image-unknown-licenses/WowLibrary-1.0.0.jar
 ✔ Cataloged contents                                                        1de53b527bd3c2749dbbdb4fa6134adb531ca55cd83af14d4095c019063af569
   ├── ✔ Packages                        [1 packages]
   ├── ✔ File digests                    [1 files]
   ├── ✔ File metadata                   [1 locations]
   └── ✔ Executables                     [0 executables]
{
 "artifacts": [
  {
   "id": "c6849507c5393b1f",
   "name": "WowLibrary",
   "version": "1.0.0",
   "type": "java-archive",
   "foundBy": "java-archive-cataloger",
   "locations": [
    {
     "path": "/WowLibrary-1.0.0.jar",
     "accessPath": "/WowLibrary-1.0.0.jar",
     "annotations": {
      "evidence": "primary"
     }
    }
   ],
   "licenses": [
    {
     "value": "UNKNOWN",
     "spdxExpression": "UNKNOWN_53dd0aabc81f0bb155c021240a512bb1bf94f95a613c2d55566231ddbfef5dd3",
     "type": "declared",
     "urls": [],
     "locations": [
      {
       "path": "/WowLibrary-1.0.0.jar",
       "accessPath": "/WowLibrary-1.0.0.jar",
       "annotations": {
        "evidence": "primary"
       }
      }
     ],
     "contents": "Wow License v1.0\n\nThis is a custom license for the Wow Library.\nUsage is restricted as per the terms outlined in this file.\n"
    }
   ],

Type of change

  • New feature (non-breaking change which adds functionality)
  • Documentation (updates the documentation)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

@spiffcs spiffcs force-pushed the 3626-disable-license-content branch from 95cf36a to b04732d Compare January 29, 2025 18:23
@spiffcs spiffcs marked this pull request as ready for review January 29, 2025 20:35
Signed-off-by: Christopher Phillips <[email protected]>
@spiffcs spiffcs requested a review from a team January 29, 2025 21:03
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
Comment on lines 74 to 78
licenseScanner := licenses.NewScanner(licensecheck.Scan, float64(75))
sc := &licenses.ScannerConfig{
CoverageThreshold: 75,
Scanner: licensecheck.Scan,
}
licenseScanner := licenses.NewScanner(sc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the scanner (and ScannerConfig) held the configuration for whether to include licenses or not? IdentifyLicenseIDs could just return nil if configured to exclude contents. This would avoid storing this configuration in the context, and the Search function just only includes contents if returned. I don't know if returning no license and nil contents would be a problem.

Copy link
Contributor Author

@spiffcs spiffcs Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two functions where we can make a decision on content are Search and IdentifyLicenseIDs

IdentifyLicenseIDs could just return nil if configured to exclude contents

I don't know if returning no license and nil contents would be a problem.

In short IdentifyLicenseIDs which is the method on scanner needs to return content in its current form:

// IdentifyLicenseIDs can only return a list of ID or content
// These return values are mutually exclusive.
// If the scanner threshold for matching scores < 75% then we return the license full content
if len(ids) > 0 {
for _, id := range ids {
lic := pkg.NewLicenseFromLocations(id, reader.Location)
lic.Type = license.Concluded
licenses = append(licenses, lic)
}
} else if len(content) > 0 {
// harmonize line endings to unix compatible first:
// 1. \r\n => \n (Windows => UNIX)
// 2. \r => \n (Macintosh => UNIX)
content = []byte(strings.ReplaceAll(strings.ReplaceAll(string(content), "\r\n", "\n"), "\r", "\n"))
lic := pkg.NewLicenseFromLocations(unknownLicenseType, reader.Location)
lic.SPDXExpression = UnknownLicensePrefix + getCustomLicenseContentHash(content)
if includeUnknownLicenseContent {
lic.Contents = string(content)
}
lic.Type = license.Declared
licenses = append(licenses, lic)
}

content is used to form the unknown unique ID when matches don't occur.
We're still creating a license with the location for the user to go investigate, we're just not including that files contents in the final SBOM that did not match the license match threshold.

There was an initial attempt to include the config on the scanner struct and not use context, but we need to return content to Search from IdentifyLicenseIDs no matter what.

This is because we're still creating a license for unknown licenses with locations and an ID that takes the sum of the content, just not including the entire text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only hold up for me is using context to pass configuration in this way, since we're not using this same pattern elsewhere; everything else is 👍.

Outside of tests, IdentifyLicenseIDs is only used by license.Search, as far as I can tell, and since both of these are internal we can modify the functions/interfaces as much as we want. I don't think we need to change the Search signature at all (though I'm confused why it takes Context and Scanner; I was under the impression it was supposed to get the scanner out of the context or use a default, maybe the scanner is that default, but it's referenced without being checked for nil).

Why don't we just change the Scanner interface to instead of IdentifyLicenseIDs has IdentifyLicenses which returns the ([]license.License, error), matching Search, and this code just gets moved to the scanner, this allows the scanner to have all the appropriate configuration for what gets returned and do whatever the right thing is there.

* main:
  Add file catalogers to selection configuration (#3505)
  chore: replace all shorthand tags of mapstruct -> mapstructure (#3633)
  chore(deps): update tools to latest versions (#3637)
  chore(deps): update CPE dictionary index (#3638)
  feat: syft 3435 - add file components to cyclonedx bom output when file metadata is available (#3539)
  chore(deps): update tools to latest versions (#3635)
  chore(deps): bump github/codeql-action from 3.28.7 to 3.28.8 (#3634)
@spiffcs spiffcs force-pushed the 3626-disable-license-content branch from b495e75 to db41a2e Compare February 5, 2025 18:20
@spiffcs
Copy link
Contributor Author

spiffcs commented Feb 5, 2025

@kzantow I've updated this PR so we're now passing configuration to the scanner constructor at the initialization of the SBOM process and not mixing its config in context here.

The only ctx value we're using at the moment is to access the scanner.
This is currently done since it's shared across the different catalogers.

The Cover functionality is thread safe:
google/licensecheck#16

I'm open to maybe using another PR in the future to do a refactor that moves licenses.Scanner out of a context lookup and onto the config object of each cataloger that invokes it, but that would require a larger refactor of:
https://github.com/anchore/syft/blob/main/internal/task/package_tasks.go

The problem being that we only want one Scanner because of the cost of initialization for all of the default licenses it knows about:

func NewDefaultScanner() Scanner {
s, err := licensecheck.NewScanner(licensecheck.BuiltinLicenses())
if err != nil {
log.WithFields("error", err).Trace("unable to create default license scanner")
s = nil
}
return &scanner{
coverageThreshold: coverageThreshold,
scanner: s.Scan,
}
}

We only want licensecheck.BuiltinLicenses() to be called once and have a single Scanner for the whole process.

It's currently very difficult with the way the task package injects configuration into the different catalogers.

Take the go module catalog constructor as an example: https://github.com/anchore/syft/blob/684b6e3f9809a95ba08cb83c8d0661ab22eed8cc/internal/task/package_tasks.go#L82C3-L87

We have access to CatalogingFactoryConfig which is built by DefaultCatalogingFactoryConfig , but I'm not a fan of initializing the scanner on this struct and passing it around the catalogers since it's currently part of the public syft API. It's also called in multiple places and currently results in multiple scanners being initalized.

Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

Signed-off-by: Christopher Phillips <[email protected]>
@spiffcs spiffcs enabled auto-merge (squash) February 5, 2025 20:29
@spiffcs spiffcs merged commit e584c9f into main Feb 5, 2025
12 checks passed
@spiffcs spiffcs deleted the 3626-disable-license-content branch February 5, 2025 20:41
spiffcs added a commit that referenced this pull request Feb 7, 2025
…suport-package-supplier

* 'main' of https://github.com/anchore/syft:
  chore(deps): bump github/codeql-action from 3.28.8 to 3.28.9 (#3648)
  feat: 3626 add option enable license content; disable by default (#3631)
juan131 pushed a commit to juan131/syft that referenced this pull request Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Configuration for including license contents in SBOM
2 participants