Commit Graph

34 Commits

Author SHA1 Message Date
64a2960751 Align paths in status messages and progress bars
Also print "Ignored" when paths are ignored due to transforms
2021-05-31 12:32:42 +02:00
adb5d4ade3 Print files that are *not* deleted by cleanup
These are files that are not present on the remote source any more, but still
present locally. They also show up in the report.
2021-05-26 10:58:19 +02:00
07a75a37c3 Fix FileNotFoundError on Windows 2021-05-25 15:57:03 +00:00
980578d05a Avoid downloading in some cases
Depending on how on_conflict is set, we can determine a few situations where
downloading is never necessary.
2021-05-25 15:20:30 +02:00
eb8b915813 Fix path prefix on windows
Previously, the path prefix was only set if "windows_paths" was true, regardless
of OS. Now the path prefix is always set on windows and never set on other OSes.
2021-05-25 14:23:38 +02:00
bce3dc384d Deduplicate path names in crawler
Also rename files so they follow the restrictions for windows file names if
we're on windows.
2021-05-25 12:11:15 +02:00
27b5a8e490 Rename log.action to log.status 2021-05-23 22:40:33 +02:00
ce1dbda5b4 Overhaul colours
"Crawled" and "Downloaded" are now printed less bright than "Crawling" and
"Downloading" as they're not as important. Explain topics are printed in yellow
to stand out a bit more from the cyan action messages.
2021-05-23 21:33:04 +02:00
6ca0ecdf05 Load and store reports 2021-05-23 20:46:29 +02:00
5edd868d5b Fix always-smart redownloading the wrong files 2021-05-23 18:49:34 +02:00
74c7b39dc8 Clean up files in alphabetical order 2021-05-23 18:39:25 +02:00
445dffc987 Reword some explanations 2021-05-23 18:35:32 +02:00
c0cecf8363 Log crawl and download actions more extensively 2021-05-23 16:25:44 +02:00
b998339002 Fix cleanup logging of paths 2021-05-23 16:25:44 +02:00
245c9c3dcc Explain output dir decisions and steps 2021-05-23 16:25:44 +02:00
803e5628a2 Clean up logging
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
ec3767c545 Create crawler base dir at start of crawl 2021-05-23 10:52:02 +02:00
44ecb2fbe7 Fix cleanup deleting crawler's base directory 2021-05-23 10:45:37 +02:00
ec95dda18f Unify crawling and downloading steps
Now, the progress bar, limiter etc. for downloading and crawling are all handled
via the reusable CrawlToken and DownloadToken context managers.
2021-05-22 21:36:53 +02:00
b4d97cd545 Improve output dir and report error handling 2021-05-22 20:54:42 +02:00
a7c025fd86 Implement reusable FileSinkToken for OutputDirectory 2021-05-19 17:16:23 +02:00
4b68fa771f Move logging logic to singleton
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
0bae009189 Run formatting tools 2021-05-16 14:32:53 +02:00
9fd356d290 Ensure tmp files are deleted
This doesn't seem to fix the case where an exception bubbles up to the top of
the event loop. It also doesn't seem to fix the case when a KeyboardInterrupt is
thrown, since that never makes its way into the event loop in the first place.

Both of these cases lead to the event loop stopping, which means that the tmp
file cleanup doesn't get executed even though it's inside a "with" or "finally".
2021-05-15 23:00:40 +02:00
989032fe0c Fix cookies getting deleted 2021-05-15 22:25:48 +02:00
05573ccc53 Add fancy CLI options 2021-05-15 22:22:01 +02:00
0c9167512c Fix output dir
I missed these while renaming the resolve function. Shame on me for not running
mypy earlier.
2021-05-14 21:28:38 +02:00
d565df27b3 Add HttpCrawler 2021-05-13 22:28:14 +02:00
38bb66a776 Update file metadata in more cases
PFERD now not only updates file metadata when a file is successfully added or
changed, but also when a file is downloaded and then detected to be unchanged.

This could occur for example if a remote file's modification time was bumped,
possibly because somebody touched the file without changing it.
2021-05-13 19:40:10 +02:00
68781a88ab Fix asynchronous methods being not awaited 2021-05-13 19:39:49 +02:00
910462bb72 Log stuff happening to files 2021-05-13 19:37:27 +02:00
6bd6adb977 Fix tmp file names 2021-05-13 19:36:46 +02:00
60cd9873bc Add local file crawler 2021-05-06 01:02:40 +02:00
bbfdadc463 Implement output directory 2021-05-05 18:08:34 +02:00