Commit Graph

14 Commits

Author SHA1 Message Date
I-Al-Istannen
3d4b997d4a Retry crawl_url and work around Python's closure handling
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
I-Al-Istannen
33a81a5f5c Document authentication in HTTP crawler and rename prepare_request 2021-05-23 11:55:34 +02:00
Joscha
803e5628a2 Clean up logging
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
I-Al-Istannen
53e031d9f6 Reuse dl/cl for I/O retries in ILIAS crawler 2021-05-23 00:28:27 +02:00
I-Al-Istannen
3053278721 Move HTTP crawler to own file 2021-05-22 23:23:21 +02:00
I-Al-Istannen
4d07de0d71 Adjust forum log message in ilias crawler 2021-05-22 23:20:21 +02:00
I-Al-Istannen
953a1bba93 Adjust to new crawl / download names 2021-05-22 23:18:05 +02:00
I-Al-Istannen
4b104b6252 Try out some HTTP authentication handling
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
I-Al-Istannen
83d12fcf2d Add some explains to ilias crawler and use crawler exceptions 2021-05-20 14:58:54 +02:00
I-Al-Istannen
e4f9560655 Only retry on aiohttp errors in ILIAS crawler
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.

In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
I-Al-Istannen
8cfa818f04 Only call should_crawl once 2021-05-19 21:57:55 +02:00
I-Al-Istannen
81301f3a76 Rename the ilias crawler to ilias web crawler 2021-05-19 21:41:17 +02:00
I-Al-Istannen
2976b4d352 Move ILIAS file templates to own file 2021-05-19 21:37:10 +02:00
I-Al-Istannen
9f03702e69 Split up ilias crawler in multiple files
The ilias crawler contained a crawler and an HTML parser, now they are
split in two.
2021-05-19 21:34:36 +02:00