Commit Graph

41 Commits

Author SHA1 Message Date
Joscha
803e5628a2 Clean up logging
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
Joscha
ec3767c545 Create crawler base dir at start of crawl 2021-05-23 10:52:02 +02:00
I-Al-Istannen
3053278721 Move HTTP crawler to own file 2021-05-22 23:23:21 +02:00
Joscha
62f0f7bfc5 Explain crawling and partially explain downloading 2021-05-22 20:39:57 +00:00
Joscha
e21795ee35 Make file cleanup part of default crawler behaviour 2021-05-22 21:45:51 +02:00
Joscha
ec95dda18f Unify crawling and downloading steps
Now, the progress bar, limiter etc. for downloading and crawling are all handled
via the reusable CrawlToken and DownloadToken context managers.
2021-05-22 21:36:53 +02:00
Joscha
098ac45758 Remove deprecated repeat decorators 2021-05-22 21:13:25 +02:00
Joscha
b4d97cd545 Improve output dir and report error handling 2021-05-22 20:54:42 +02:00
Joscha
98b8ca31fa Add some todos 2021-05-22 14:45:46 +02:00
I-Al-Istannen
4b104b6252 Try out some HTTP authentication handling
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
Joscha
0d10752b5a Configure explain log level via cli and config file 2021-05-19 17:50:10 +02:00
Joscha
92886fb8d8 Implement --version flag 2021-05-19 17:33:36 +02:00
Joscha
b7a999bc2e Clean up crawler exceptions and (a)noncritical 2021-05-19 13:25:57 +02:00
Joscha
4b68fa771f Move logging logic to singleton
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
Joscha
0bae009189 Run formatting tools 2021-05-16 14:32:53 +02:00
Joscha
05573ccc53 Add fancy CLI options 2021-05-15 22:22:01 +02:00
Joscha
b70b62cef5 Make crawler sections start with "crawl:"
Also, use only the part of the section name after the "crawl:" as the crawler's
output directory. Now, the implementation matches the documentation again
2021-05-15 17:24:37 +02:00
Joscha
595de88d96 Fix authenticator and crawler names
Now, the "auth:" and "crawl:" parts are considered part of the name. This fixes
crawlers not being able to find their authenticators.
2021-05-15 15:25:05 +02:00
Joscha
b0f731bf84 Make crawlers use transformers 2021-05-15 15:25:05 +02:00
Joscha
acd674f0a0 Change limiter logic
Now download tasks are a subset of all tasks.
2021-05-15 15:25:05 +02:00
Joscha
ed2e19a150 Add reasons for invalid values 2021-05-15 15:25:05 +02:00
Joscha
296a169dd3 Make limiter logic more complex
The limiter can now distinguish between crawl and download actions and has a
fancy slot system and delay logic.
2021-05-15 15:25:05 +02:00
Joscha
6e5fdf4e9e Set user agent to "pferd/<version>" 2021-05-14 21:27:44 +02:00
Joscha
d565df27b3 Add HttpCrawler 2021-05-13 22:28:14 +02:00
Joscha
68781a88ab Fix asynchronous methods being not awaited 2021-05-13 19:39:49 +02:00
Joscha
0acdee15a0 Let crawlers obtain authenticators 2021-05-13 18:57:20 +02:00
Joscha
d5f29f01c5 Use global conductor instance
The switch from crawler-local conductors to a single pferd-global conductor was
made to prepare for auth section credential providers.
2021-05-11 00:05:04 +02:00
Joscha
cec0a8e1fc Fix mymy errors 2021-05-09 01:45:01 +02:00
Joscha
60cd9873bc Add local file crawler 2021-05-06 01:02:40 +02:00
Joscha
273d56c39a Properly load crawler config 2021-05-05 23:45:10 +02:00
Joscha
5497dd2827 Add @noncritical and @repeat decorators 2021-05-05 23:36:54 +02:00
Joscha
bbfdadc463 Implement output directory 2021-05-05 18:08:34 +02:00
Joscha
91c33596da Load crawlers from config file 2021-04-30 16:22:14 +02:00
Joscha
f776186480 Use PurePath instead of Path
Path should only be used when we need to access the file system. For all other
purposes (mainly crawling), we use PurePath instead since the paths don't
correspond to paths in the local file system.
2021-04-29 20:20:25 +02:00
Joscha
502654d853 Fix mypy errors 2021-04-29 15:47:52 +02:00
Joscha
d2103d7c44 Document crawler 2021-04-29 15:43:20 +02:00
Joscha
d96a361325 Test and fix exclusive output 2021-04-29 15:27:16 +02:00
Joscha
2e85d26b6b Use conductor via context manager 2021-04-29 14:23:28 +02:00
Joscha
6431a3fb3d Fix some mypy errors 2021-04-29 14:23:09 +02:00
Joscha
ac3bfd7388 Make progress bars easier to use
The crawler now supports two types of progress bars
2021-04-29 13:53:16 +02:00
Joscha
bbc792f9fb Implement Crawler and DummyCrawler 2021-04-29 13:44:29 +02:00