Commit Graph

30 Commits

Author SHA1 Message Date
Joscha
eb8b915813 Fix path prefix on windows
Previously, the path prefix was only set if "windows_paths" was true, regardless
of OS. Now the path prefix is always set on windows and never set on other OSes.
2021-05-25 14:23:38 +02:00
Joscha
bce3dc384d Deduplicate path names in crawler
Also rename files so they follow the restrictions for windows file names if
we're on windows.
2021-05-25 12:11:15 +02:00
Joscha
27b5a8e490 Rename log.action to log.status 2021-05-23 22:40:33 +02:00
Joscha
ce1dbda5b4 Overhaul colours
"Crawled" and "Downloaded" are now printed less bright than "Crawling" and
"Downloading" as they're not as important. Explain topics are printed in yellow
to stand out a bit more from the cyan action messages.
2021-05-23 21:33:04 +02:00
Joscha
6ca0ecdf05 Load and store reports 2021-05-23 20:46:29 +02:00
Joscha
5edd868d5b Fix always-smart redownloading the wrong files 2021-05-23 18:49:34 +02:00
Joscha
74c7b39dc8 Clean up files in alphabetical order 2021-05-23 18:39:25 +02:00
Joscha
445dffc987 Reword some explanations 2021-05-23 18:35:32 +02:00
Joscha
c0cecf8363 Log crawl and download actions more extensively 2021-05-23 16:25:44 +02:00
Joscha
b998339002 Fix cleanup logging of paths 2021-05-23 16:25:44 +02:00
Joscha
245c9c3dcc Explain output dir decisions and steps 2021-05-23 16:25:44 +02:00
Joscha
803e5628a2 Clean up logging
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
Joscha
ec3767c545 Create crawler base dir at start of crawl 2021-05-23 10:52:02 +02:00
Joscha
44ecb2fbe7 Fix cleanup deleting crawler's base directory 2021-05-23 10:45:37 +02:00
Joscha
ec95dda18f Unify crawling and downloading steps
Now, the progress bar, limiter etc. for downloading and crawling are all handled
via the reusable CrawlToken and DownloadToken context managers.
2021-05-22 21:36:53 +02:00
Joscha
b4d97cd545 Improve output dir and report error handling 2021-05-22 20:54:42 +02:00
Joscha
a7c025fd86 Implement reusable FileSinkToken for OutputDirectory 2021-05-19 17:16:23 +02:00
Joscha
4b68fa771f Move logging logic to singleton
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
Joscha
0bae009189 Run formatting tools 2021-05-16 14:32:53 +02:00
Joscha
9fd356d290 Ensure tmp files are deleted
This doesn't seem to fix the case where an exception bubbles up to the top of
the event loop. It also doesn't seem to fix the case when a KeyboardInterrupt is
thrown, since that never makes its way into the event loop in the first place.

Both of these cases lead to the event loop stopping, which means that the tmp
file cleanup doesn't get executed even though it's inside a "with" or "finally".
2021-05-15 23:00:40 +02:00
Joscha
989032fe0c Fix cookies getting deleted 2021-05-15 22:25:48 +02:00
Joscha
05573ccc53 Add fancy CLI options 2021-05-15 22:22:01 +02:00
Joscha
0c9167512c Fix output dir
I missed these while renaming the resolve function. Shame on me for not running
mypy earlier.
2021-05-14 21:28:38 +02:00
Joscha
d565df27b3 Add HttpCrawler 2021-05-13 22:28:14 +02:00
Joscha
38bb66a776 Update file metadata in more cases
PFERD now not only updates file metadata when a file is successfully added or
changed, but also when a file is downloaded and then detected to be unchanged.

This could occur for example if a remote file's modification time was bumped,
possibly because somebody touched the file without changing it.
2021-05-13 19:40:10 +02:00
Joscha
68781a88ab Fix asynchronous methods being not awaited 2021-05-13 19:39:49 +02:00
Joscha
910462bb72 Log stuff happening to files 2021-05-13 19:37:27 +02:00
Joscha
6bd6adb977 Fix tmp file names 2021-05-13 19:36:46 +02:00
Joscha
60cd9873bc Add local file crawler 2021-05-06 01:02:40 +02:00
Joscha
bbfdadc463 Implement output directory 2021-05-05 18:08:34 +02:00