Commit Graph

57 Commits

Author SHA1 Message Date
I-Al-Istannen
05ad06fbc1 Only enclose get_page in iorepeat in ILIAS crawler
We previously also gathered in there, which could lead to some more
surprises when the method was retried.
2021-05-23 18:14:51 +02:00
Joscha
29d5a40c57 Replace asyncio.gather with custom Crawler function 2021-05-23 17:25:16 +02:00
I-Al-Istannen
e1d18708b3 Rename "no_videos" to videos 2021-05-23 13:30:42 +02:00
Joscha
7e0bb06259 Clean up TODOs 2021-05-23 12:47:30 +02:00
I-Al-Istannen
ecdedfa1cf Add no-videos flag to ILIAS crawler 2021-05-23 12:37:01 +02:00
I-Al-Istannen
3d4b997d4a Retry crawl_url and work around Python's closure handling
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
I-Al-Istannen
33a81a5f5c Document authentication in HTTP crawler and rename prepare_request 2021-05-23 11:55:34 +02:00
Joscha
803e5628a2 Clean up logging
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
I-Al-Istannen
53e031d9f6 Reuse dl/cl for I/O retries in ILIAS crawler 2021-05-23 00:28:27 +02:00
I-Al-Istannen
3053278721 Move HTTP crawler to own file 2021-05-22 23:23:21 +02:00
I-Al-Istannen
4d07de0d71 Adjust forum log message in ilias crawler 2021-05-22 23:20:21 +02:00
I-Al-Istannen
953a1bba93 Adjust to new crawl / download names 2021-05-22 23:18:05 +02:00
Joscha
ae3d80664c Update local crawler to new crawler structure 2021-05-22 21:46:36 +02:00
I-Al-Istannen
4b104b6252 Try out some HTTP authentication handling
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
I-Al-Istannen
83d12fcf2d Add some explains to ilias crawler and use crawler exceptions 2021-05-20 14:58:54 +02:00
I-Al-Istannen
e4f9560655 Only retry on aiohttp errors in ILIAS crawler
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.

In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
I-Al-Istannen
8cfa818f04 Only call should_crawl once 2021-05-19 21:57:55 +02:00
I-Al-Istannen
81301f3a76 Rename the ilias crawler to ilias web crawler 2021-05-19 21:41:17 +02:00
I-Al-Istannen
2976b4d352 Move ILIAS file templates to own file 2021-05-19 21:37:10 +02:00
I-Al-Istannen
9f03702e69 Split up ilias crawler in multiple files
The ilias crawler contained a crawler and an HTML parser, now they are
split in two.
2021-05-19 21:34:36 +02:00
Joscha
0d10752b5a Configure explain log level via cli and config file 2021-05-19 17:50:10 +02:00
Joscha
5916626399 Make noqua comment more specific 2021-05-19 17:16:59 +02:00
Joscha
3851065500 Fix local crawler's download bars
Display the pure path instead of the local path.
2021-05-18 23:23:40 +02:00
Joscha
4b68fa771f Move logging logic to singleton
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
I-Al-Istannen
1525aa15a6 Fix link template error and use indeterminate progress bar 2021-05-18 22:40:28 +02:00
I-Al-Istannen
db1219d4a9 Create a link file in ILIAS crawler
This allows us to crawl links and represent them in the file system.
Users can choose between an ILIAS-imitation (that optionally
auto-redirects) and a plain text variant.
2021-05-17 21:44:54 +02:00
I-Al-Istannen
b8efcc2ca5 Respect filters in ILIAS crawler 2021-05-17 21:30:26 +02:00
Joscha
0bae009189 Run formatting tools 2021-05-16 14:32:53 +02:00
I-Al-Istannen
8b76ebb3ef Rename IliasCrawler to KitIliasCrawler 2021-05-16 13:28:06 +02:00
I-Al-Istannen
2b6235dc78 Fix pylint warnings (and 2 found bugs) in ILIAS crawler 2021-05-16 13:17:12 +02:00
I-Al-Istannen
1c226c31aa Add some repeat annotations to the ILIAS crawler 2021-05-16 13:01:56 +02:00
I-Al-Istannen
9ec0d3e16a Implement date-demangling in ILIAS crawler 2021-05-16 13:01:56 +02:00
I-Al-Istannen
cf6903d109 Retry crawling on I/O failure 2021-05-16 13:01:56 +02:00
I-Al-Istannen
c454fabc9d Add support for exercises in ILIAS crawler 2021-05-15 21:40:17 +02:00
I-Al-Istannen
7d323ec62b Implement video downloads in ilias crawler 2021-05-15 21:32:32 +02:00
I-Al-Istannen
c7494e32ce Start implementing crawling in ILIAS crawler
The ilias crawler can now crawl quite a few filetypes, splits off
folders and crawls them concurrently.
2021-05-15 20:42:18 +02:00
I-Al-Istannen
1123c8884d Implement an IliasPage
This allows PFERD to semantically understand ILIAS HTML and is the
foundation for the ILIAS crawler. This patch extends the ILIAS crawler
to crawl the personal desktop and print the elements on it.
2021-05-15 18:59:23 +02:00
Joscha
8c32da7f19 Let authenticators provide username and password separately 2021-05-15 18:27:03 +02:00
Joscha
d63494908d Properly invalidate exceptions
The simple authenticator now properly invalidates its credentials. Also, the
invalidation functions have been given better names and documentation.
2021-05-15 17:37:05 +02:00
Joscha
868f486922 Rename local crawler path to target 2021-05-15 17:12:25 +02:00
I-Al-Istannen
b2a2b5999b Implement ILIAS auth and crawl home page
This commit introduces the necessary machinery to authenticate with
ILIAS and crawl the home page.

It can't do much yet and just silently fetches the homepage.
2021-05-15 15:25:05 +02:00
Joscha
b0f731bf84 Make crawlers use transformers 2021-05-15 15:25:05 +02:00
Joscha
302b8c0c34 Fix errors loading local crawler config
Apparently getint and getfloat may return a None even though this is not
mentioned in their type annotations.
2021-05-15 15:25:05 +02:00
Joscha
ed2e19a150 Add reasons for invalid values 2021-05-15 15:25:05 +02:00
Joscha
1591cb9197 Add options to slow down local crawler
These options are meant to make the local crawler behave more like a
network-based crawler for purposes of testing and debugging other parts of the
code base.
2021-05-15 15:25:01 +02:00
Joscha
94d6a01cca Use file mtime in local crawler 2021-05-13 19:42:40 +02:00
Joscha
0acdee15a0 Let crawlers obtain authenticators 2021-05-13 18:57:20 +02:00
Joscha
c3ce6bb31c Fix crawler cleanup not being awaited 2021-05-11 00:28:45 +02:00
Joscha
d5f29f01c5 Use global conductor instance
The switch from crawler-local conductors to a single pferd-global conductor was
made to prepare for auth section credential providers.
2021-05-11 00:05:04 +02:00
Joscha
595ba8b7ab Remove dummy crawler 2021-05-10 23:47:46 +02:00