Previously, we were assuming that `which <python>` return the path to
the python executable. This is not true when using pyenv shims, which
are bash scripts. Instead, we have to use `sys.executable`. Luckily,
we're already querying the python interpreter and can do it in that
pass.
We are also not allowed to cache the execution of the python interpreter
through the shim because pyenv might change the target. As a heuristic,
we check whether `sys.executable`, the real binary, is the same our
canonicalized `which` result.
---------
Co-authored-by: Zanie Blue <contact@zanie.dev>
## Summary
This PR implements logic to sort wheels by priority, where priority is
defined as preferring more "specific" wheels over less "specific"
wheels. For example, in the case of Black, my machine now selects
`black-23.11.0-cp311-cp311-macosx_11_0_arm64.whl`, whereas sorting by
lowest priority instead gives me `black-23.11.0-py3-none-any.whl`.
As part of this change, I've also modified the resolver to fallback to
using incompatible wheels when determining package metadata, if no
compatible wheels are available.
The `VersionMap` was also moved out of `resolver.rs` and into its own
file with a wrapper type, for clarity.
Closes https://github.com/astral-sh/puffin/issues/380.
Closes https://github.com/astral-sh/puffin/issues/421.
The pip compile test now explicitly set their python version and `puffin
venv` resolves e.g. `python3.12` correctly now. The venv creation is
moved to a shared method
Implement two behaviors for yanked versions:
* During `pip-compile`, yanked versions are filtered out entirely, we
currently treat them is if they don't exist. This is leads to confusing
error messages because a version that does exist seems to have suddenly
disappeared.
* During `pip-sync`, we warn when we fetch a remote distribution and it
has been yanked. We currently don't warn on cached or installed
distributions that have been yanked.
Filter out source dists and wheels whose `requires-python` from the
simple api is incompatible with the current python version.
This change showed an important problem: When we use a fake python
version for resolving, building source distributions breaks down because
we can only build with versions we actually have.
This change became surprisingly big. The tests now require python 3.7 to
be installed, but changing that would mean an even bigger change.
Fixes#388
## Summary
Closes https://github.com/astral-sh/puffin/issues/390.
## Test Plan
Installed `jupyter_core==5.5.0`, then removed the `jupyter_core` and
`jupyter_core-5.5.0.dist-info` directories from my virtualenv manually,
but left `jupyter.py`. I then re-ran `puffin pip-compile`, and verified
that it errored on `main` but succeeded here.
This copies the allocator configuration used in the Ruff project. In
particular, this gives us an instant 10% win when resolving the top 1K
PyPI packages:
$ hyperfine \
"./target/profiling/puffin-dev-main resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null" \
"./target/profiling/puffin-dev resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null"
Benchmark 1: ./target/profiling/puffin-dev-main resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null
Time (mean ± σ): 974.2 ms ± 26.4 ms [User: 17503.3 ms, System: 2205.3
ms]
Range (min … max): 943.5 ms … 1015.9 ms 10 runs
Benchmark 2: ./target/profiling/puffin-dev resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null
Time (mean ± σ): 883.1 ms ± 23.3 ms [User: 14626.1 ms, System: 2542.2
ms]
Range (min … max): 849.5 ms … 916.9 ms 10 runs
Summary
'./target/profiling/puffin-dev resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null' ran
1.10 ± 0.04 times faster than './target/profiling/puffin-dev-main
resolve-many --cache-dir cache-docker-no-build --no-build
pypi_top_8k_flat.txt --limit 1000 2> /dev/null'
I was moved to do this because I noticed `malloc`/`free` taking up a
fairly sizeable percentage of time during light profiling.
As is becoming a pattern, it will be easier to review this
commit-by-commit.
Ref #396 (wouldn't call this issue fixed)
-----
I did also try adding a `smallvec` optimization to the
`Version::release` field, but it didn't bare any fruit. I still think
there is more to explore since the results I observed don't quite line
up with what I expect. (So probably either my mental model is off or my
measurement process is flawed.) You can see that attempt with a little
more explanation here:
f9528b4ecd
In the course of adding the `smallvec` optimization, I also shrunk the
`Version` fields from a `usize` to a `u32`. They should at least be a
fixed size integer since version numbers aren't used to index memory,
and I shrunk it to `u32` since it seems reasonable to assume that all
version numbers will be smaller than `2^32`.
Before:
```
cargo run --bin puffin-dev -q -- resolve-cli "transformers[accelerate, agents, all, audio, codecarbon, deepspeed, deepspeed-testing, dev, dev-tensorflow, dev-torch, docs, docs_specific, flax, flax-speech, ftfy, integrations, ja, modelcreation, onnx, onnxruntime, optuna, quality, ray, retrieval, sagemaker, sentencepiece, serving, sigopt, sklearn, speech, testing, tf, tf-cpu, tf-speech, timm, tokenizers, torch, torch-speech, torch-vision, torchhub, video, vision]"
puffin-dev failed
Caused by: No solution found when resolving: transformers[accelerate,agents,all,audio,codecarbon,deepspeed,deepspeed-testing,dev,dev-tensorflow,dev-torch,docs,docs-specific,flax,flax-speech,ftfy,integrations,ja,modelcreation,onnx,onnxruntime,optuna,quality,ray,retrieval,sagemaker,sentencepiece,serving,sigopt,sklearn,speech,testing,tf,tf-cpu,tf-speech,timm,tokenizers,torch,torch-speech,torch-vision,torchhub,video,vision]
Caused by: Not a valid package or extra name: ".none". Names must start and end with a letter or digit and may only contain -, _, ., and alphanumeric characters
```
After:
```
cargo run --bin puffin-dev -q -- resolve-cli "transformers[accelerate, agents, all, audio, codecarbon, deepspeed, deepspeed-testing, dev, dev-tensorflow, dev-torch, docs, docs_specific, flax, flax-speech, ftfy, integrations, ja, modelcreation, onnx, onnxruntime, optuna, quality, ray, retrieval, sagemaker, sentencepiece, serving, sigopt, sklearn, speech, testing, tf, tf-cpu, tf-speech, timm, tokenizers, torch, torch-speech, torch-vision, torchhub, video, vision]"
puffin-dev failed
Caused by: No solution found when resolving: transformers[accelerate,agents,all,audio,codecarbon,deepspeed,deepspeed-testing,dev,dev-tensorflow,dev-torch,docs,docs-specific,flax,flax-speech,ftfy,integrations,ja,modelcreation,onnx,onnxruntime,optuna,quality,ray,retrieval,sagemaker,sentencepiece,serving,sigopt,sklearn,speech,testing,tf,tf-cpu,tf-speech,timm,tokenizers,torch,torch-speech,torch-vision,torchhub,video,vision]
Caused by: Couldn't parse metadata in fastapi-0.10.1-py3-none-any.whl (97ac91cb7cd2baab1a50b0c7a17d83/fastapi-0.10.1-py3-none-any.whl)
Caused by: Not a valid package or extra name: ".none". Names must start and end with a letter or digit and may only contain -, _, ., and alphanumeric characters
```
## Summary
It looks like, when you install `pip`, it includes a bunch of
`__pycache__` directories in the RECORD file (although these directories
don't exist until you run `pip`). Our uninstaller assumed that the
RECORD file only contained _files_.
Closes https://github.com/astral-sh/puffin/issues/389.
I intend this to become the main form of caching for puffin: You can
make http requests, you tranform the data to what you really need, you
have control over the cache key, and the cache is always json (or
anything else much faster we want to replace it with as long as it's
serde!)
## Summary
This PR refactors our `RemoteDistribution` type such that it now follows
a clear hierarchy that matches the actual variants, and encodes the
differences between source and built distributions:
```rust
pub enum Distribution {
Built(BuiltDistribution),
Source(SourceDistribution),
}
pub enum BuiltDistribution {
Registry(RegistryBuiltDistribution),
DirectUrl(DirectUrlBuiltDistribution),
}
pub enum SourceDistribution {
Registry(RegistrySourceDistribution),
DirectUrl(DirectUrlSourceDistribution),
Git(GitSourceDistribution),
}
/// A built distribution (wheel) that exists in a registry, like `PyPI`.
pub struct RegistryBuiltDistribution {
pub name: PackageName,
pub version: Version,
pub file: File,
}
/// A built distribution (wheel) that exists at an arbitrary URL.
pub struct DirectUrlBuiltDistribution {
pub name: PackageName,
pub url: Url,
}
/// A source distribution that exists in a registry, like `PyPI`.
pub struct RegistrySourceDistribution {
pub name: PackageName,
pub version: Version,
pub file: File,
}
/// A source distribution that exists at an arbitrary URL.
pub struct DirectUrlSourceDistribution {
pub name: PackageName,
pub url: Url,
}
/// A source distribution that exists in a Git repository.
pub struct GitSourceDistribution {
pub name: PackageName,
pub url: Url,
}
```
Most of the PR just stems downstream from this change. There are no
behavioral changes, so I'm largely relying on lint, tests, and the
compiler for correctness.
This PR tweaks the representation of `Tags` in order to offer a
faster implementation of `WheelFilename::is_compatible`. We now use a
nested map of tags that lets us avoid looping over every supported
platform tag. As the code comments suggest, that is the essential gain.
We still do not mind looping over the tags in each wheel name since they
tend to be quite small. And pushing our thumb on that side of things can
make things worse overall since it would likely slow down WheelFilename
construction itself.
For micro-benchmarks, we improve considerably for compatibility
checking:
$ critcmp base test3
group base test3
----- ---- -----
build_platform_tags/burntsushi-archlinux 1.00 46.2±0.28µs ? ?/sec 2.48
114.8±0.45µs ? ?/sec
wheelname_parsing/flyte-long-compatible 1.00 624.8±3.31ns 174.0 MB/sec
1.01 629.4±4.30ns 172.7 MB/sec
wheelname_parsing/flyte-long-incompatible 1.00 743.6±4.23ns 165.4 MB/sec
1.00 746.9±4.62ns 164.7 MB/sec
wheelname_parsing/flyte-short-compatible 1.00 526.7±4.76ns 54.3 MB/sec
1.01 530.2±5.81ns 54.0 MB/sec
wheelname_parsing/flyte-short-incompatible 1.00 540.4±4.93ns 60.0 MB/sec
1.01 545.7±5.31ns 59.4 MB/sec
wheelname_parsing_failure/flyte-long-extension 1.00 13.6±0.13ns 3.2
GB/sec 1.01 13.7±0.14ns 3.2 GB/sec
wheelname_parsing_failure/flyte-short-extension 1.00 14.0±0.20ns 1160.4
MB/sec 1.01 14.1±0.14ns 1146.5 MB/sec
wheelname_tag_compatibility/flyte-long-compatible 11.33 159.8±2.79ns
680.5 MB/sec 1.00 14.1±0.23ns 7.5 GB/sec
wheelname_tag_compatibility/flyte-long-incompatible 237.60
1671.8±37.99ns 73.6 MB/sec 1.00 7.0±0.08ns 17.1 GB/sec
wheelname_tag_compatibility/flyte-short-compatible 16.07 223.5±8.60ns
128.0 MB/sec 1.00 13.9±0.30ns 2.0 GB/sec
wheelname_tag_compatibility/flyte-short-incompatible 149.83 628.3±2.13ns
51.6 MB/sec 1.00 4.2±0.10ns 7.6 GB/sec
We do regress slightly on the time it takes for `Tags::new` to run, but
this is somewhat expected. And in absolute terms, 114us is perfectly
acceptable given that it's only executed ~once for each `puffin`
invocation.
Ad hoc benchmarks indicate an overall 25% perf improvement in `puffin
pip-compile` times. This roughly corresponds with how much time
`is_compatible` was taking. Indeed, profiling confirms that it has
virtually disappeared from the profile.
Fixes#157
## Summary
Low-priority but fun thing to end the day. You can now pass
`--target-version py37`, and we'll generate a resolution for Python 3.7.
See: https://github.com/astral-sh/puffin/issues/183.
One of the most common errors i observed are build failures due to
missing header files. On ubuntu, this generally means that you need to
install some `<...>-dev` package that the documentation tells you about,
e.g. [mysqlclient](https://github.com/PyMySQL/mysqlclient#linux) needs
`default-libmysqlclient-dev`, [some psycopg
versions](https://www.psycopg.org/psycopg3/docs/basic/install.html#local-installation)
(i remember that this was always required at some earlier point) require
`libpq-dev` and pygraphviz wants `graphviz-dev`. This is quite common
for many scientific packages (where conda has an advantage because they
can provide those package as a dependency).
The error message can be completely inscrutable if you're just a python
programmer (or user) and not a c programmer (example: pygraphviz):
```
warning: no files found matching '*.png' under directory 'doc'
warning: no files found matching '*.txt' under directory 'doc'
warning: no files found matching '*.css' under directory 'doc'
warning: no previously-included files matching '*~' found anywhere in distribution
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '.svn' found anywhere in distribution
no previously-included directories found matching 'doc/build'
pygraphviz/graphviz_wrap.c:3020:10: fatal error: graphviz/cgraph.h: No such file or directory
3020 | #include "graphviz/cgraph.h"
| ^~~~~~~~~~~~~~~~~~~
compilation terminated.
error: command '/usr/bin/gcc' failed with exit code 1
```
The only relevant part is `Fatal error: graphviz/cgraph.h: No such file
or directory`. Why is this file not there and how do i get it to be
there?
This is even harder to spot in pip's output, where it's 11 lines above
the last line:

I've special cased missing headers and made sure that the last line
tells you the important information: We're missing some header, please
check the documentation of {package} {version} for what to install:

Scrolling up:

The difference gets even clearer with a default ubuntu terminal with its
80 columns:

---
Note that the situation is better for a missing compiler, there i get:
```
[...]
warning: no previously-included files matching '*~' found anywhere in distribution
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '.svn' found anywhere in distribution
no previously-included directories found matching 'doc/build'
error: command 'gcc' failed: No such file or directory
---
```
Putting the last line into google, the first two results tell me to
`sudo apt-get install gcc`, the third even tells me about `sudo apt
install build-essential`
By default, we will build source distributions for both resolving and
installing, running arbitrary code. `--no-build` adds an option to ban
this and only install from wheels, no source distributions or git builds
allowed. We also don't fetch these and instead report immediately.
I've heard from users for whom this is a requirement, i'm implementing
it now because it's helpful for testing.
I'm thinking about adding a shared `PuffinSharedArgs` struct so we don't
have to repeat each option everywhere.
Closes https://github.com/astral-sh/puffin/issues/356.
The example from the issue now renders as:
```
❯ cargo run --bin puffin-dev -q -- resolve-cli tensorflow-cpu-aws
puffin-dev failed
Caused by: No solution found when resolving build dependencies for source distribution:
Caused by: Because there is no available version for tensorflow-cpu-aws and root depends on tensorflow-cpu-aws, version solving failed.
```
This was dumb of me. We pass out indexes when adding progress bars, but
were then removing entries on completion, so any outstanding indexes
were now _invalid_. We just shouldn't remove them. The `MultiProgress`
retains a reference anyway, IIUC.
Closes https://github.com/astral-sh/puffin/issues/360.
Rejigger Linux platform detection
This change makes some very small improvements to the Linux platform
detection logic. In particular, the existing logic did not work on my
Archlinux machine since /lib64/ld-linux-x86-64.so.2 isn't a symlink. In
that case, the detection logic should have fallen back to the slower
`ldd --version` technique, but `read_link` fails outright when its
argument isn't a symbolic link. So we tweak the logic to allow it to
fail, and if it does, we still try the `ldd --version` approach instead
of giving up completely.
I also made some cosmetic improvements to the regex matching, as well as
ensuring that the regexes are only compiled exactly once.
We now write the `direct_url.json` when installing, and _skip_
installing if we find a package installed via the direct URL that the
user is requesting.
A lot of TODOs, especially around cleaning up the `Source` abstraction
and its relationship to `DirectUrl`. I'm gonna keep working on these
today, but this works and makes the requirements clear.
Closes#332.
## Summary
This PR just adds the logic in `install-wheel-rs` to write
`direct_url.json`. We're not actually taking advantage of it yet (or
wiring it through) in Puffin.
Part of https://github.com/astral-sh/puffin/issues/332.
Ran `cargo upgrade --incompatible`, seems there are no changes required.
From cacache 0.12.0:
> BREAKING CHANGE: some signatures for copy have changed, and copy no
longer automatically reflinks
`which` 5.0.0 seems to have only error message changes.
## Summary
This is a first-pass at adding source distribution support to the
installer.
The previous installation flow was:
1. Come up with a plan.
1. Find a distribution (specific file) for every package that we'll need
to download.
1. Download those distributions.
1. Unzip them (since we assumed they were all wheels).
1. Install them into the virtual environment.
Now, Step (3) downloads both wheels and source distributions, and we
insert a step between Steps (3) and (4) to build any source
distributions into zipped wheels.
There are a bunch of TODOs, the most important (IMO) is that we
basically have two implementations of downloading and building, between
the stuff in `puffin_installer` and `puffin_resolver` (namely in
`crates/puffin-resolver/src/distribution`). I didn't attempt to clean
that up here -- it's already a problem, and it's related to the overall
problem we need to solve around unified caching and resource management.
Closes#243.
`PackageName` and `ExtraName` can now only be constructed from valid
names. They share the same rules, so i gave them the same
implementation. Constructors are split between `new` (owned) and
`from_str` (borrowed), with the owned version avoiding allocations.
Closes#279
---------
Co-authored-by: Zanie <contact@zanie.dev>
## Summary
This just enables the `DistributionFinder` (previously known as the
`WheelFinder`) to select source distributions when there are no matching
wheels for a given platform. As a reminder, the `DistributionFinder` is
a simple resolver that doesn't look at any dependencies: it just takes a
set of pinned packages, and finds a distribution to install to satisfy
each requirement.
In the resolver, our current model for solving URL dependencies requires
that we visit the URL dependency _before_ the registry-based dependency.
This PR encodes a strict requirement that all URL dependencies be
declared upfront, either as requirements or constraints.
I wrote more about how it works and why it's necessary in documentation
[here](https://github.com/astral-sh/puffin/pull/319/files#diff-2b1c4f36af0c62a2b7bebeae9473ae083588f2a6b18a3ec52393a24266adecbbR20).
I think we could relax this constraint over time, but it requires a more
sophisticated model -- and for now, I just want something that's (1)
correct, (2) easy for us to reason about, and (3) easy for users to
reason about.
As additional motivation... allowing arbitrary URL dependencies anywhere
in the tree creates some really confusing situations in which I'm not
even sure what the right answers are. For example, assume you declare a
direct dependency on `Werkzeug==2.0.0`. You then depend on a version of
Flask that depends on a version of `Werkzeug` from some arbitrary URL.
You build the source distribution at that arbitrary URL, and it turns
out it _does_ build to a declared version of 2.0.0. What should happen?
(And if it resolves to a version that _isn't_ 2.0.0, what should happen
_then_?) I suspect different tools handle this differently, but it must
lead to a lot of "silent" failures. In my testing of Poetry, it seems
like Poetry just ignores the URL dependency, which seems wrong, but is
also a behavior we could implement in the future.
Closes https://github.com/astral-sh/puffin/issues/303.
Closes https://github.com/astral-sh/puffin/issues/284.
Ensures that if we need to access the same Git repo twice in a
resolution, we only have one handler to that repo at a time. (Otherwise,
`git2` panics.)
This PR adds a mechanism by which we can ensure that we _always_ try to
refresh Git dependencies when resolving; further, we now write the fully
resolved SHA to the "lockfile". However, nothing in the code _assumes_
we do this, so the installer will remain agnostic to this behavior.
The specific approach taken here is minimally invasive. Specifically,
when we try to fetch a source distribution, we check if it's a Git
dependency; if it is, we fetch, and return the exact SHA, which we then
map back to a new URL. In the resolver, we keep track of URL
"redirects", and then we use the redirect (1) for the actual source
distribution building, and (2) when writing back out to the lockfile. As
such, none of the types outside of the resolver change at all, since
we're just mapping `RemoteDistribution` to `RemoteDistribution`, but
swapping out the internal URLs.
There are some inefficiencies here since, e.g., we do the Git fetch,
send back the "precise" URL, then a moment later, do a Git checkout of
that URL (which will be _mostly_ a no-op -- since we have a full SHA, we
don't have to fetch anything, but we _do_ check back on disk to see if
the SHA is still checked out). A more efficient approach would be to
return the path to the checked-out revision when we do this conversion
to a "precise" URL, since we'd then only interact with the Git repo
exactly once. But this runs the risk that the checked-out SHA changes
between the time we make the "precise" URL and the time we build the
source distribution.
Closes#286.
Extends #295Closes#214
Copies some of the implementations from `pubgrub::report` so we can
implement Puffin `PubGrubPackage` specific display when explaining
failed resolutions.
Here, we just drop the dummy version number if it's a
`PubGrubPackage::Root` package. In the future, we can further customize
reporting.
Part of https://github.com/astral-sh/puffin/issues/214
Adds a `project: Option<PackageName>` to the `Manifest`, `Resolver`, and
`RequirementsSpecification`.
To populate an optional `name` for `PubGubPackage::Root`.
I'll work on removing the version number next.
Should we consider using the parent directory name when a
`pyproject.toml` file is not present?
From
https://packaging.python.org/en/latest/specifications/recording-installed-packages/#recording-installed-packages
> This directory is named as {name}-{version}.dist-info, with name and
version fields corresponding to Core metadata specifications. Both
fields must be normalized (see Package name normalization and PEP 440
for the definition of normalization for each field respectively), and
replace dash (-) characters with underscore (_) characters, so the
.dist-info directory always has exactly one dash (-) character in its
stem, separating the name and version fields.
Follow up to #278
This PR makes the cache non-optional in most of Puffin, which simplifies
the code, allows us to reuse the cache within a single command (even
with `--no-cache`), and also allows us to use the cache for disk storage
across an invocation.
I left the cache as optional for the `Virtualenv` and `InterpreterInfo`
abstractions, since those are generic enough that it seems nice to have
a non-cached version, but it's kind of arbitrary.
## Summary
This PR adds support for Git dependencies, like:
```
flask @ git+https://github.com/pallets/flask.git
```
Right now, they're only supported in the resolver (and not the
installer), since the installer doesn't yet support source distributions
at all.
The general approach here is based on Cargo's Git implementation.
Specifically, I adapted Cargo's
[`git`](23eb492cf9/src/cargo/sources/git/mod.rs)
module to perform the cloning, which is based on `libgit2`.
As compared to Cargo's implementation, I made the following changes:
- Removed any unnecessary code.
- Fixed any Clippy errors for our stricter ruleset.
- Removed the dependency on `curl`, in favor of `reqwest` which we use
elsewhere.
- Removed the ability to use `gix`. Cargo allows the use of `gix` as an
experimental flag, but it only supports a small subset of the
operations. When Cargo fully adopts `gix`, we should plan to do the
same.
- Removed Cargo's host key checking. We need to re-add this! I'll do it
shortly.
- Removed Cargo's progress bars. We should re-add this too, but we use
`indicatif` and Cargo had their own thing.
There are a few follow-ups to consider:
- Adding support in the installer.
- When we lock, we should write out the Git URL that includes the exact
SHA. This lets us cache in perpetuity and avoids dependencies changing
without re-locking.
- When we resolve, we should _always_ try to refresh Git dependencies.
(Right now, we skip if the wheel was already built.)
I'll work on the latter two in follow-up PRs.
Closes#202.
The normalized name abstractions were not consistently, this PR uses
them where they were previously missing:
* `WheelFilename::distribution`
* `Requirement::name`
* `Requirement::extras`
* `Metadata21::name`
* `Metadata21::provides_dist`
With `puffin-package` depending on `pep508_rs` this would be cyclical
crate dependency, so `puffin-normalize` gets split out from
`puffin-package`.
`DistInfoName` has the same task and semantics as `PackageName`, so it's
merged into the latter.
`PackageName` and `ExtraName` documentation is moved onto the type and
their constructors are called `new` instead of `normalize`. We now use
these constructors rarely enough the implicit allocation by
`to_string()` shouldn't matter anymore, while more actual cloning
becomes visible.
This docker container provides isolation of source distribution builds,
whether [intended to be
helpful](https://pypi.org/project/nvidia-pyindex/) or other more or less
malicious forms of host system modification.
Fixes#194
---------
Co-authored-by: Zanie Blue <contact@zanie.dev>
Hard linking might not be supported but we (afaik) can't detect this
ahead of time, so we'll try hard linking the first file, if this
succeeds we'll know later hard linking errors are not due to lack of
os/fs support, if it fails we'll switch to copying for the rest of the
install. Follow up to
https://github.com/astral-sh/puffin/pull/237#discussion_r1376705137
Show the resolution in a concise format for puffin-dev. Note that this
doesn't affect the main puffin output, it's just more convenient for me
when developing.
Extends #254
Adds validation of extra names provided by users in `pip-compile` e.g.
```
error: invalid value 'foo!' for '--extra <EXTRA>': Extra names must start and end with a
letter or digit and may only contain -, _, ., and alphanumeric characters
```
We'll want to add something similar to `PackageName`. I'd be curious to
improve the AP, making the unvalidated nature of `::normalize` clear?
Perhaps worth pursuing later though as I don't have a better idea.
## Summary
This PR adds support for resolving and installing dependencies via
direct URLs, like:
```
werkzeug @ 960bb4017c4aed12b5ed8b78e0153e/Werkzeug-2.0.0-py3-none-any.whl
```
These are fairly common (e.g., with `torch`), but you most often see
them as Git dependencies.
Broadly, structs like `RemoteDistribution` and friends are now enums
that can represent either registry-based dependencies or URL-based
dependencies:
```rust
/// A built distribution (wheel) that exists as a remote file (e.g., on `PyPI`).
#[derive(Debug, Clone)]
#[allow(clippy::large_enum_variant)]
pub enum RemoteDistribution {
/// The distribution exists in a registry, like `PyPI`.
Registry(PackageName, Version, File),
/// The distribution exists at an arbitrary URL.
Url(PackageName, Url),
}
```
In the resolver, we now allow packages to take on an extra, optional
`Url` field:
```rust
#[derive(Debug, Clone, Eq, Derivative)]
#[derivative(PartialEq, Hash)]
pub enum PubGrubPackage {
Root,
Package(
PackageName,
Option<DistInfoName>,
#[derivative(PartialEq = "ignore")]
#[derivative(PartialOrd = "ignore")]
#[derivative(Hash = "ignore")]
Option<Url>,
),
}
```
However, for the purpose of version satisfaction, we ignore the URL.
This allows for the URL dependency to satisfy the transitive request in
cases like:
```
flask==3.0.0
werkzeug @ 254c3e9b5f5941e900b71206e6313b/werkzeug-3.0.1-py3-none-any.whl
```
There are a couple limitations in the current approach:
- The caching for remote URLs is done separately in the resolver vs. the
installer. I decided not to sweat this too much... We need to figure out
caching holistically.
- We don't support any sort of time-based cache for remote URLs -- they
just exist forever. This will be a problem for URL dependencies, where
we need some way to evict and refresh them. But I've deferred it for
now.
- I think I need to redo how this is modeled in the resolver, because
right now, we don't detect a variety of invalid cases, e.g., providing
two different URLs for a dependency, asking for a URL dependency and a
_different version_ of the same dependency in the list of first-party
dependencies, etc.
- (We don't yet support VCS dependencies.)
Tests would sometimes flake with this locally e.g. "1.50s" was not
filtered correctly.
Verified with
```diff
diff --git a/crates/puffin-cli/src/commands/pip_compile.rs b/crates/puffin-cli/src/commands/pip_compile.rs
index 0193216..2d6f8af 100644
--- a/crates/puffin-cli/src/commands/pip_compile.rs
+++ b/crates/puffin-cli/src/commands/pip_compile.rs
@@ -150,6 +150,8 @@ pub(crate) async fn pip_compile(
result => result,
}?;
+ std:🧵:sleep(std::time::Duration::from_secs(1));
+
let s = if resolution.len() == 1 { "" } else { "s" };
writeln!(
printer,
```
This also allows us to get rid of `PinnedPackage` _and_ to remove some
`Result<...>` types due to needless conversions between
otherwise-identical types.
Extends #253Closes#241
Adds `extras` to `RequirementsSpecification` to track extras used to
construct the requirements so we can throw an error when not all of the
requested extras are used.
Going to add some tests.
Extends #239Closes#245
Normalizes optional dependency group names found in pyproject files
before comparing them to the normalized user-requested extras.
Adds support for `pip-compile --extra <name> ...` which includes
optional dependencies in the specified group in the resolution.
Following precedent in `pip-compile`, if a given extra is not found,
there is no error. ~We could consider warning in this case.~ We should
probably add an error but it expands scope and will be considered
separately in #241
There are packages such as DTLSSocket 0.1.16 that say
```toml
[build-system]
requires = ["Cython<3", "setuptools", "wheel"]
```
In this case we need to install requires PEP 517 style but then call setup.py in the
legacy way
Part of making home-assistant work
musl (which we already use in ruff) allows statically linked binaries on
linux. This PR switches to rustls and vendors and fixes the glibc
detection. Using static musl builds makes it easier to avoid glibc
errors in docker and we'll need it later for alpine users anyway.
An alternative is using vendored openssl.
I didn't realize this, but they made a bunch of improvements to how
PubGrub represents versions which lets us greatly simplify our own
PubGrub version wrapper
(https://github.com/pubgrub-rs/guide/pull/6/files).
We now accept a pre-release if (1) all versions are pre-releases, or (2)
there was a pre-release marker in the dependency specifiers for a direct
dependency.
The code is written such that we can support a variety of pre-release
strategies.
Closes https://github.com/astral-sh/puffin/issues/191.
To check to top 1k (current state):
```bash
scripts/resolve/get_pypi_top_8k.sh
cargo run --bin puffin-dev -- resolve-many scripts/resolve/pypi_top_8k_flat.txt --limit 1000
```
Results:
```
Errors: pywin32, geoip2, maxminddb, pypika, dirac
Success: 995, Error: 5
```
pywin32 has no solution for the build environment, 3 have no
`[build-system]` entry in pyproject.toml, `dirac` is missing cmake
Currently, this is only the source distribution building feature moved.
It's intended that we can add development and test commands there
without affecting the main cli surface
Select a compatible wheel for a version, even we already found a source
distribution previously.
If no wheel is found, select the most recent source distribution, not
the oldest compatible one.
This fixes the resolution of `mst.in`, which i added
Like `pip-compile`, we now respect existing versions from the
`requirements.txt` provided via `--output-file`, unless you pass a
`--upgrade` flag.
Closes#166.
Modifies the resolver to remove any incompatible distributions upfront,
and store them in an index by version. This will be necessary to support
`--upgrade` semantics.
This actually does cause a meaningful slowdown right now (since we now
iterate over all files, even if we otherwise never would've needed to
touch them), but we should be able to optimize it out later.
Everywhere else, we use cache to refer to a filesystem cache, so this is
kind of confusing. It's really an in-memory index that we build up over
the course of the solve.
Previously, we had two python interpreter metadata structs, one in
gourgeist and one in puffin. Both would spawn a subprocess to query
overlapping metadata and both would appear in the cli crate, if you
weren't careful you could even have to different base interpreters at
once. This change unifies this to one set of metadata, queried and
cached once.
Another effect of this crate is proper separation of python interpreter
and venv. A base interpreter (such as `/usr/bin/python/`, but also pyenv
and conda installed python) has a set of metadata. A venv has a root and
inherits the base python metadata except for `sys.prefix`, which unlike
`sys.base_prefix`, gets set to the venv root. From the root and the
interpreter info we can compute the paths inside the venv. We can reuse
the interpreter info of the base interpreter when creating a venv
without having to query the newly created `python`.
This is isn't ready, but it can resolve
`meine_stadt_transparent==0.2.14`.
The source distributions are currently being built serially one after
the other, i don't know if that is incidentally due to the resolution
order, because sdist building is blocking or because of something in the
resolver that could be improved.
It's a bit annoying that the thing that was supposed to do http requests
now suddenly also has to a whole download/unpack/resolve/install/build
routine, it messes up the type hierarchy. The much bigger problem though
is avoid recursive crate dependencies, it's the reason for the callback
and for splitting the builder into two crates (badly named atm)
As elsewhere, we just use the `pip` and `pip-compile` APIs. So we
support `--index-url` to override PyPI, then `--extra-index-url` to add
_additional_ indexes, and `--no-index` to avoid hitting the index at
all.
Closes#156.
Allows the user to select between clone, hardlink, and copy semantics
for installs. (The pnpm documentation has a decent description of what
these mean: https://pnpm.io/npmrc#package-import-method.)
Closes#159.
Borrows terminology from pnpm by introducing three resolution modes:
- "Highest": always choose the highest compliant version (default).
- "Lowest": always choose the lowest compliant version.
- "LowestDirect": choose the lowest compliant version of direct
dependencies, and the highest compliant version of any transitive
dependencies. (This makes a bit more sense than "lowest".)
Closes https://github.com/astral-sh/puffin/issues/142.
The need for this became clear when working on the source distribution
integration into the resolver.
While at it i also switch the `WheelFilename` version to the parsed
`pep440_rs` version now that we have this crate.
Builds up a complete resolved graph from PubGrub, and shows the sources
that led to each package being included in the resolution, like
`pip-compile`.
Closes https://github.com/astral-sh/puffin/issues/60.
Kind of an oversight in my initial implementation. If we find that any
package has _no_ matching versions, we should select it! This lets us
short-circuit _immediately_ when top-level dependencies aren't
satisfiable.
Updates to `29c48fb9f3daa11bd02794edd55060d0b01ee705` from the
`pubgrub-rs` dev branch. This lets us reduce the number of changes we've
made to PubGrub itself (now, only changing visibility to export a few
things from the `solver.rs` module).
Gets rid of the custom `DistInfo` struct in the site-packages
abstraction in favor of a new kind of distribution
(`InstalledDistribution`). No change in behavior.
This is also a lot faster. Unfortunately it copies a lot of code from
the sync cli since the `Printer` is private.
The first commit are some refactorings i made when i thought about how i
could reuse the existing code.
Support calling `prepare_metadata_for_build_wheel`, which can give you
the metadata without executing the actual build if the backend supports
it.
This makes the code a lot uglier since we effectively have a state
machine:
* Setup: Either venv plus requires (PEP 517) or just a venv (setup.py)
* Get metadata (optional step): None (setup.py) or
`prepare_metadata_for_build_wheel` and saving that result
* Build: `setup.py`, `build_wheel()` or
`build_wheel(metadata_directory=metadata_directory)`, but i think i got
general flow right.
@charliermarsh This is a "barely works but unblocks building on top"
implementation, say if you want more polishing (i'll look at this again
tomorrow)
This needs far better error handling and user-facing feedback, but it
does the basic operation (and includes discovery of the `pyproject.toml`
file, etc.).
This PR enables us to make "fixups" to bad metadata. I copied over the
one fixup that @konstin made in `monotrail-resolve`, and added a few
common ones for `Requires-Python`.
This PR reverts #109 which is actually a performance _regression_ since
we need to iterate over a bunch of wheels that we could otherwise
entirely ignore.
Rather than constantly iterating over all files and testing their
compatibility with the current platform, just store wheels we can
actually consider in the solver cache.
This adds a basic sdist builder that has been tested with two source
distributions, one with a PEP 517 backend and one with setup.py.
It uses pip for requirements installation atm, lacks testing in all
directions, lacks checks for recursive requirements, can't pass in
already resolved versions, doesn't support prepare metadata for build to
allow resolution to continue without doing the actual (native) build,
error messages are mediocre, etc.
```console
$ RUST_LOG=puffin_build=debug puffin-build --wheels wheels downloads/tqdm-4.66.1.tar.gz
2023-10-16T12:28:35.503182Z DEBUG build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}: puffin_build: Building downloads/tqdm-4.66.1.tar.gz
2023-10-16T12:28:35.521780Z INFO build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}:extract_archive: puffin_build: close time.busy=18.4ms time.idle=16.7µs
2023-10-16T12:28:35.845096Z DEBUG build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}:resolve_and_install: puffin_build: Calling pip to install build dependencies
2023-10-16T12:28:37.668660Z INFO build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}:resolve_and_install: puffin_build: close time.busy=1.82s time.idle=13.2µs
2023-10-16T12:28:37.668744Z DEBUG build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}: puffin_build: Calling `setuptools.build_meta.get_requires_for_build_wheel()`
2023-10-16T12:28:38.159205Z INFO build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}:run_python_script{python_interpreter="/tmp/.tmpm4cTra/venv/bin/python"}: puffin_build: close time.busy=490ms time.idle=13.0µs
2023-10-16T12:28:38.159304Z DEBUG build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}: puffin_build: Calling `setuptools.build_meta.build_wheel()`
2023-10-16T12:28:38.501732Z INFO build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}:run_python_script{python_interpreter="/tmp/.tmpm4cTra/venv/bin/python"}: puffin_build: close time.busy=342ms time.idle=15.2µs
2023-10-16T12:28:38.522700Z INFO build_sdist{path="downloads/tqdm-4.66.1.tar.gz" base_python="/usr/bin/python3"}: puffin_build: close time.busy=3.02s time.idle=16.2µs
Wheel built to /home/konsti/projects/puffin/crates/puffin-build/wheels/tqdm-4.66.1-py3-none-any.whl
2023-10-16T12:28:38.522772Z DEBUG puffin_build: Took 3020ms
$ puffin-build --wheels wheels downloads/geoextract-0.3.1.tar.gz
2023-10-16T12:28:40.884622Z DEBUG build_sdist{path="downloads/geoextract-0.3.1.tar.gz" base_python="/usr/bin/python3"}: puffin_build: Building downloads/geoextract-0.3.1.tar.gz
2023-10-16T12:28:40.887743Z INFO build_sdist{path="downloads/geoextract-0.3.1.tar.gz" base_python="/usr/bin/python3"}:extract_archive: puffin_build: close time.busy=2.97ms time.idle=12.6µs
2023-10-16T12:28:41.469738Z INFO build_sdist{path="downloads/geoextract-0.3.1.tar.gz" base_python="/usr/bin/python3"}: puffin_build: close time.busy=585ms time.idle=15.3µs
Wheel built to /home/konsti/projects/puffin/crates/puffin-build/wheels/geoextract-0.3.1-py3-none-any.whl
2023-10-16T12:28:41.469814Z DEBUG puffin_build: Took 585ms
```
The main change is to print the whole error chain. We can combine this
with adding `.context` to distinct phases to be able to locate crashes
without having to use a debugger.
## Summary
This PR enables the proof-of-concept resolver to backtrack by way of
using the `pubgrub-rs` crate.
Rather than using PubGrub as a _framework_ (implementing the
`DependencyProvider` trait, letting PubGrub call us), I've instead
copied over PubGrub's primary solver hook (which is only ~100 lines or
so) and modified it for our purposes (e.g., made it async).
There's a lot to improve here, but it's a start that will let us
understand PubGrub's appropriateness for this problem space. A few
observations:
- In simple cases, the resolver is slower than our current (naive)
resolver. I think it's just that the pipelining isn't as efficient as in
the naive case, where we can just stream package and version fetches
concurrently without any bottlenecks.
- A lot of the code here relates to bridging PubGrub with our own
abstractions -- so we need a `PubGrubPackage`, a `PubGrubVersion`, etc.
This fixes two bugs on linux:
`/tmp` and `$HOME` are technically on two different partitions on my
machine, which means that rename-as-atomic-dir-write doesn't work. The
solution is to create the temp dir in the target directory.
zip files may contain directory entries, we can't create files for them
but need to create directories. We could skip them though because iirc
they are not in the RECORD so they won't be uninstalled.
We can always restore these from history, but right now, it feels a lot
more productive to just hit PyPI directly for our integration tests,
since we don't have to spend time figuring out mocks.
Remove the parser I wrote in favor of Konsti's which is much more
complete. The only change vs. the version in `poc-monotrail` is that I
changed the tests to use insta rather than manually storing and
comparing against JSON snapshots.
Closes https://github.com/astral-sh/puffin/issues/89.
It looks like using _either_ async Rust with a `JoinSet` _or_
parallelizing a fixed threadpool with Rayon provide about a ~5% speed-up
over our current serial approach:
```console
❯ hyperfine --runs 30 --warmup 5 --prepare "./target/release/puffin venv .venv" \
"./target/release/rayon sync ./scripts/benchmarks/requirements-large.txt" \
"./target/release/async sync ./scripts/benchmarks/requirements-large.txt" \
"./target/release/main sync ./scripts/benchmarks/requirements-large.txt"
Benchmark 1: ./target/release/rayon sync ./scripts/benchmarks/requirements-large.txt
Time (mean ± σ): 295.7 ms ± 16.9 ms [User: 28.6 ms, System: 263.3 ms]
Range (min … max): 249.2 ms … 315.9 ms 30 runs
Benchmark 2: ./target/release/async sync ./scripts/benchmarks/requirements-large.txt
Time (mean ± σ): 296.2 ms ± 20.2 ms [User: 36.1 ms, System: 340.1 ms]
Range (min … max): 258.0 ms … 359.4 ms 30 runs
Benchmark 3: ./target/release/main sync ./scripts/benchmarks/requirements-large.txt
Time (mean ± σ): 306.6 ms ± 19.5 ms [User: 25.3 ms, System: 220.5 ms]
Range (min … max): 269.6 ms … 332.2 ms 30 runs
Summary
'./target/release/rayon sync ./scripts/benchmarks/requirements-large.txt' ran
1.00 ± 0.09 times faster than './target/release/async sync ./scripts/benchmarks/requirements-large.txt'
1.04 ± 0.09 times faster than './target/release/main sync ./scripts/benchmarks/requirements-large.txt'
```
It's much easier to just parallelize with Rayon and avoid async in the
underlying wheel code, so this PR takes that approach for now.
Mocks out the PyPI client using some checked-in fixtures. The test is
very basic, and I'm not very happy with all the ceremony around the
mocks and such, but it's an interesting experiment at least.
When we go to install a locked `requirements.txt`, if a wheel is already
available in the local cache, and matches the version specifiers, we can
just use it directly without fetching the package metadata. This speeds
up the no-op case by about 33%.
Closes https://github.com/astral-sh/puffin/issues/48.
I think this isn't necessary to support in this generic crate. If we
choose to adopt Monotrail-style concepts, we'll likely need to rework
them anyway.
It turns out that on macOS, you can pass `clonefile` a directory to
recursively copy an entire directory. This speeds up wheel installation
dramatically, by about 3x.
The setup is now as follows:
- All user-facing logging goes through `tracing` at an `info` leve.
(This excludes messages that go to `stdout`, like the compiled
`requirements.txt` file.)
- We have `--quiet` and `--verbose` command-line flags to set the
tracing filter and format defaults. So if you use `--verbose`, we
include timestamps and targets, and filter at `puffin=debug` level.
- However, we always respect `RUST_LOG`. So you can override the
_filter_ via `RUST_LOG`.
For example: the standard setup filters to `puffin=info`, and doesn't
show timestamps or targets:
<img width="1235" alt="Screen Shot 2023-10-08 at 3 41 22 PM"
src="https://github.com/astral-sh/puffin/assets/1309177/54ca4db6-c66a-439e-bfa3-b86dee136e45">
If you run with `--verbose`, you get debug logging, but confined to our
crates:
<img width="1235" alt="Screen Shot 2023-10-08 at 3 41 57 PM"
src="https://github.com/astral-sh/puffin/assets/1309177/c5c1af11-7f7a-4038-a173-d9eca4c3630b">
If you want verbose logging with _all_ crates, you can add
`RUST_LOG=debug`:
<img width="1235" alt="Screen Shot 2023-10-08 at 3 42 39 PM"
src="https://github.com/astral-sh/puffin/assets/1309177/0b5191f4-4db0-4db9-86ba-6f9fa521bcb6">
I think this is a reasonable setup, though we can see how it feels and
refine over time.
Closes https://github.com/astral-sh/puffin/issues/57.
This PR gets `gourgeist` passing our local CI and integrated into the
broader workspace.
There's some duplicate between concepts in `gourgeist` (like the
`InterpreterInfo`) and structs we have elsewhere, but we can tackle
those later.
This PR copies over the `gourgeist` crate at commit
`e64c17a263dac6933702dc8d155425c053fe885a` with no modifications.
It won't pass CI, but modifications will intentionally be confined to
later PRs.
This PR massively speeds up the case in which you need to install wheels
that already exist in the global cache.
The new strategy is as follows:
- Download the wheel into the content-addressed cache.
- Unzip the wheel into the cache, but ignore content-addressing. It
turns out that writing to `cacache` for every file in the zip added a
ton of overhead, and I don't see any actual advantages to doing so.
Instead, we just unzip the contents into a directory at, e.g.,
`~/.cache/puffin/django-4.1.5`.
- (The unzip itself is now parallelized with Rayon.)
- When installing the wheel, we now support unzipping from a directory
instead of a zip archive. This required duplicating and tweaking a few
functions.
- When installing the wheel, we now use reflinks (or copy-on-write
links). These have a few fantastic properties: (1) they're extremely
cheap to create (on macOS, they are allegedly faster than hard links);
(2) they minimize disk space, since we avoid copying files entirely in
the vast majority of cases; and (3) if the user then edits a file
locally, the cache doesn't get polluted. Orogene, Bun, and soon pnpm all
use reflinks.
Puffin is now ~15x faster than `pip` for the common case of installing
cached data into a fresh environment.
Closes https://github.com/astral-sh/puffin/issues/21.
Closes https://github.com/astral-sh/puffin/issues/39.
This PR modifies the `install-wheel-rs` (and a few other crates) to get
everything playing nicely. Specifically, CI should pass, and all these
crates now use workspace dependencies between one another.
As part of this change, I split out the wheel name parsing into its own
`wheel-filename` crate, and the compatibility tag parsing into its own
`platform-tags` crate.
This PR copies over the `install-wheel-rs` crate at commit
`10730ea1a84c58af6b35fb74c89ed0578ab042b6` with no modifications.
It won't pass CI, but modifications will intentionally be confined to
later PRs.
This PR modifies the PEP 440 and PEP 508 crates to pass CI, primarily by
fixing all lint violations.
We're also now using these crates in the workspace via `path`.
(Previously, we were still fetching them from Cargo.)
This PR copies over the `pep440-rs` crate at commit
`82aa5d4dcbe676b121dc931b0afa09a82de8e3d7` with no modifications.
It won't pass CI, but modifications will intentionally be confined to
later PRs.
This PR copies over the `pep440-rs` crate at commit
`a8303b01ffef6fccfdce562a887f6b110d482ef3` with no modifications.
It won't pass CI, but modifications will intentionally be confined to
later PRs.