I-Al-Istannen
3d4b997d4a
Retry crawl_url and work around Python's closure handling
...
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
I-Al-Istannen
33a81a5f5c
Document authentication in HTTP crawler and rename prepare_request
2021-05-23 11:55:34 +02:00
Joscha
803e5628a2
Clean up logging
...
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
I-Al-Istannen
53e031d9f6
Reuse dl/cl for I/O retries in ILIAS crawler
2021-05-23 00:28:27 +02:00
I-Al-Istannen
3053278721
Move HTTP crawler to own file
2021-05-22 23:23:21 +02:00
I-Al-Istannen
4d07de0d71
Adjust forum log message in ilias crawler
2021-05-22 23:20:21 +02:00
I-Al-Istannen
953a1bba93
Adjust to new crawl / download names
2021-05-22 23:18:05 +02:00
I-Al-Istannen
4b104b6252
Try out some HTTP authentication handling
...
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
I-Al-Istannen
83d12fcf2d
Add some explains to ilias crawler and use crawler exceptions
2021-05-20 14:58:54 +02:00
I-Al-Istannen
e4f9560655
Only retry on aiohttp errors in ILIAS crawler
...
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.
In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
I-Al-Istannen
8cfa818f04
Only call should_crawl once
2021-05-19 21:57:55 +02:00
I-Al-Istannen
81301f3a76
Rename the ilias crawler to ilias web crawler
2021-05-19 21:41:17 +02:00
I-Al-Istannen
2976b4d352
Move ILIAS file templates to own file
2021-05-19 21:37:10 +02:00
I-Al-Istannen
9f03702e69
Split up ilias crawler in multiple files
...
The ilias crawler contained a crawler and an HTML parser, now they are
split in two.
2021-05-19 21:34:36 +02:00