Joscha
29d5a40c57
Replace asyncio.gather with custom Crawler function
2021-05-23 17:25:16 +02:00
Joscha
c0cecf8363
Log crawl and download actions more extensively
2021-05-23 16:25:44 +02:00
Joscha
b44b49476d
Fix noncritical and anoncritical decorators
...
I must've forgot to update the anoncritical decorator when I last changed the
noncritical decorator. Also, every exception should make the crawler not
error_free, not just CrawlErrors.
2021-05-23 13:24:53 +02:00
Joscha
803e5628a2
Clean up logging
...
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
Joscha
ec3767c545
Create crawler base dir at start of crawl
2021-05-23 10:52:02 +02:00
I-Al-Istannen
3053278721
Move HTTP crawler to own file
2021-05-22 23:23:21 +02:00
Joscha
62f0f7bfc5
Explain crawling and partially explain downloading
2021-05-22 20:39:57 +00:00
Joscha
e21795ee35
Make file cleanup part of default crawler behaviour
2021-05-22 21:45:51 +02:00
Joscha
ec95dda18f
Unify crawling and downloading steps
...
Now, the progress bar, limiter etc. for downloading and crawling are all handled
via the reusable CrawlToken and DownloadToken context managers.
2021-05-22 21:36:53 +02:00
Joscha
098ac45758
Remove deprecated repeat decorators
2021-05-22 21:13:25 +02:00
Joscha
b4d97cd545
Improve output dir and report error handling
2021-05-22 20:54:42 +02:00
Joscha
98b8ca31fa
Add some todos
2021-05-22 14:45:46 +02:00
I-Al-Istannen
4b104b6252
Try out some HTTP authentication handling
...
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
Joscha
0d10752b5a
Configure explain log level via cli and config file
2021-05-19 17:50:10 +02:00
Joscha
92886fb8d8
Implement --version flag
2021-05-19 17:33:36 +02:00
Joscha
b7a999bc2e
Clean up crawler exceptions and (a)noncritical
2021-05-19 13:25:57 +02:00
Joscha
4b68fa771f
Move logging logic to singleton
...
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
Joscha
0bae009189
Run formatting tools
2021-05-16 14:32:53 +02:00
Joscha
05573ccc53
Add fancy CLI options
2021-05-15 22:22:01 +02:00
Joscha
b70b62cef5
Make crawler sections start with "crawl:"
...
Also, use only the part of the section name after the "crawl:" as the crawler's
output directory. Now, the implementation matches the documentation again
2021-05-15 17:24:37 +02:00
Joscha
595de88d96
Fix authenticator and crawler names
...
Now, the "auth:" and "crawl:" parts are considered part of the name. This fixes
crawlers not being able to find their authenticators.
2021-05-15 15:25:05 +02:00
Joscha
b0f731bf84
Make crawlers use transformers
2021-05-15 15:25:05 +02:00
Joscha
acd674f0a0
Change limiter logic
...
Now download tasks are a subset of all tasks.
2021-05-15 15:25:05 +02:00
Joscha
ed2e19a150
Add reasons for invalid values
2021-05-15 15:25:05 +02:00
Joscha
296a169dd3
Make limiter logic more complex
...
The limiter can now distinguish between crawl and download actions and has a
fancy slot system and delay logic.
2021-05-15 15:25:05 +02:00
Joscha
6e5fdf4e9e
Set user agent to "pferd/<version>"
2021-05-14 21:27:44 +02:00
Joscha
d565df27b3
Add HttpCrawler
2021-05-13 22:28:14 +02:00
Joscha
68781a88ab
Fix asynchronous methods being not awaited
2021-05-13 19:39:49 +02:00
Joscha
0acdee15a0
Let crawlers obtain authenticators
2021-05-13 18:57:20 +02:00
Joscha
d5f29f01c5
Use global conductor instance
...
The switch from crawler-local conductors to a single pferd-global conductor was
made to prepare for auth section credential providers.
2021-05-11 00:05:04 +02:00
Joscha
cec0a8e1fc
Fix mymy errors
2021-05-09 01:45:01 +02:00
Joscha
60cd9873bc
Add local file crawler
2021-05-06 01:02:40 +02:00
Joscha
273d56c39a
Properly load crawler config
2021-05-05 23:45:10 +02:00
Joscha
5497dd2827
Add @noncritical and @repeat decorators
2021-05-05 23:36:54 +02:00
Joscha
bbfdadc463
Implement output directory
2021-05-05 18:08:34 +02:00
Joscha
91c33596da
Load crawlers from config file
2021-04-30 16:22:14 +02:00
Joscha
f776186480
Use PurePath instead of Path
...
Path should only be used when we need to access the file system. For all other
purposes (mainly crawling), we use PurePath instead since the paths don't
correspond to paths in the local file system.
2021-04-29 20:20:25 +02:00
Joscha
502654d853
Fix mypy errors
2021-04-29 15:47:52 +02:00
Joscha
d2103d7c44
Document crawler
2021-04-29 15:43:20 +02:00
Joscha
d96a361325
Test and fix exclusive output
2021-04-29 15:27:16 +02:00
Joscha
2e85d26b6b
Use conductor via context manager
2021-04-29 14:23:28 +02:00
Joscha
6431a3fb3d
Fix some mypy errors
2021-04-29 14:23:09 +02:00
Joscha
ac3bfd7388
Make progress bars easier to use
...
The crawler now supports two types of progress bars
2021-04-29 13:53:16 +02:00
Joscha
bbc792f9fb
Implement Crawler and DummyCrawler
2021-04-29 13:44:29 +02:00