Commit Graph

259 Commits

Author SHA1 Message Date
I-Al-Istannen
1c226c31aa Add some repeat annotations to the ILIAS crawler 2021-05-16 13:01:56 +02:00
I-Al-Istannen
9ec0d3e16a Implement date-demangling in ILIAS crawler 2021-05-16 13:01:56 +02:00
I-Al-Istannen
cf6903d109 Retry crawling on I/O failure 2021-05-16 13:01:56 +02:00
Joscha
9fd356d290 Ensure tmp files are deleted
This doesn't seem to fix the case where an exception bubbles up to the top of
the event loop. It also doesn't seem to fix the case when a KeyboardInterrupt is
thrown, since that never makes its way into the event loop in the first place.

Both of these cases lead to the event loop stopping, which means that the tmp
file cleanup doesn't get executed even though it's inside a "with" or "finally".
2021-05-15 23:00:40 +02:00
Joscha
989032fe0c Fix cookies getting deleted 2021-05-15 22:25:48 +02:00
Joscha
05573ccc53 Add fancy CLI options 2021-05-15 22:22:01 +02:00
I-Al-Istannen
c454fabc9d Add support for exercises in ILIAS crawler 2021-05-15 21:40:17 +02:00
I-Al-Istannen
7d323ec62b Implement video downloads in ilias crawler 2021-05-15 21:32:32 +02:00
I-Al-Istannen
c7494e32ce Start implementing crawling in ILIAS crawler
The ilias crawler can now crawl quite a few filetypes, splits off
folders and crawls them concurrently.
2021-05-15 20:42:18 +02:00
I-Al-Istannen
1123c8884d Implement an IliasPage
This allows PFERD to semantically understand ILIAS HTML and is the
foundation for the ILIAS crawler. This patch extends the ILIAS crawler
to crawl the personal desktop and print the elements on it.
2021-05-15 18:59:23 +02:00
Joscha
e1104f888d Add tfa authenticator 2021-05-15 18:27:16 +02:00
Joscha
8c32da7f19 Let authenticators provide username and password separately 2021-05-15 18:27:03 +02:00
Joscha
d63494908d Properly invalidate exceptions
The simple authenticator now properly invalidates its credentials. Also, the
invalidation functions have been given better names and documentation.
2021-05-15 17:37:05 +02:00
Joscha
b70b62cef5 Make crawler sections start with "crawl:"
Also, use only the part of the section name after the "crawl:" as the crawler's
output directory. Now, the implementation matches the documentation again
2021-05-15 17:24:37 +02:00
Joscha
868f486922 Rename local crawler path to target 2021-05-15 17:12:25 +02:00
I-Al-Istannen
b2a2b5999b Implement ILIAS auth and crawl home page
This commit introduces the necessary machinery to authenticate with
ILIAS and crawl the home page.

It can't do much yet and just silently fetches the homepage.
2021-05-15 15:25:05 +02:00
Joscha
595de88d96 Fix authenticator and crawler names
Now, the "auth:" and "crawl:" parts are considered part of the name. This fixes
crawlers not being able to find their authenticators.
2021-05-15 15:25:05 +02:00
Joscha
a6fdf05ee9 Allow variable whitespace in arrow rules 2021-05-15 15:25:05 +02:00
Joscha
f897d7c2e1 Add name variants for all arrows 2021-05-15 15:25:05 +02:00
Joscha
b0f731bf84 Make crawlers use transformers 2021-05-15 15:25:05 +02:00
Joscha
302b8c0c34 Fix errors loading local crawler config
Apparently getint and getfloat may return a None even though this is not
mentioned in their type annotations.
2021-05-15 15:25:05 +02:00
Joscha
acd674f0a0 Change limiter logic
Now download tasks are a subset of all tasks.
2021-05-15 15:25:05 +02:00
Joscha
ed2e19a150 Add reasons for invalid values 2021-05-15 15:25:05 +02:00
Joscha
296a169dd3 Make limiter logic more complex
The limiter can now distinguish between crawl and download actions and has a
fancy slot system and delay logic.
2021-05-15 15:25:05 +02:00
Joscha
1591cb9197 Add options to slow down local crawler
These options are meant to make the local crawler behave more like a
network-based crawler for purposes of testing and debugging other parts of the
code base.
2021-05-15 15:25:01 +02:00
Joscha
0c9167512c Fix output dir
I missed these while renaming the resolve function. Shame on me for not running
mypy earlier.
2021-05-14 21:28:38 +02:00
Joscha
a673ab0fae Delete old files
I should've done this earlier
2021-05-14 21:27:44 +02:00
Joscha
6e5fdf4e9e Set user agent to "pferd/<version>" 2021-05-14 21:27:44 +02:00
Joscha
93a5a94dab Single-source version number 2021-05-14 21:27:44 +02:00
Joscha
d565df27b3 Add HttpCrawler 2021-05-13 22:28:14 +02:00
Joscha
e3ee4e515d Disable highlighting of primitives
This commit prevents rich from highlighting python-looking syntax like numbers,
arrays, 'None' etc.
2021-05-13 19:47:44 +02:00
Joscha
94d6a01cca Use file mtime in local crawler 2021-05-13 19:42:40 +02:00
Joscha
38bb66a776 Update file metadata in more cases
PFERD now not only updates file metadata when a file is successfully added or
changed, but also when a file is downloaded and then detected to be unchanged.

This could occur for example if a remote file's modification time was bumped,
possibly because somebody touched the file without changing it.
2021-05-13 19:40:10 +02:00
Joscha
68781a88ab Fix asynchronous methods being not awaited 2021-05-13 19:39:49 +02:00
Joscha
910462bb72 Log stuff happening to files 2021-05-13 19:37:27 +02:00
Joscha
6bd6adb977 Fix tmp file names 2021-05-13 19:36:46 +02:00
Joscha
0acdee15a0 Let crawlers obtain authenticators 2021-05-13 18:57:20 +02:00
Joscha
c3ce6bb31c Fix crawler cleanup not being awaited 2021-05-11 00:28:45 +02:00
Joscha
0459ed093e Add simple authenticator
... including some required authenticator infrastructure
2021-05-11 00:28:03 +02:00
Joscha
d5f29f01c5 Use global conductor instance
The switch from crawler-local conductors to a single pferd-global conductor was
made to prepare for auth section credential providers.
2021-05-11 00:05:04 +02:00
Joscha
595ba8b7ab Remove dummy crawler 2021-05-10 23:47:46 +02:00
Joscha
cec0a8e1fc Fix mymy errors 2021-05-09 01:45:01 +02:00
Joscha
f9b2fd60e2 Document local crawler and auth 2021-05-09 01:33:47 +02:00
Joscha
60cd9873bc Add local file crawler 2021-05-06 01:02:40 +02:00
Joscha
273d56c39a Properly load crawler config 2021-05-05 23:45:10 +02:00
Joscha
5497dd2827 Add @noncritical and @repeat decorators 2021-05-05 23:36:54 +02:00
Joscha
bbfdadc463 Implement output directory 2021-05-05 18:08:34 +02:00
Joscha
07e831218e Add sync report 2021-05-02 00:56:10 +02:00
Joscha
91c33596da Load crawlers from config file 2021-04-30 16:22:14 +02:00
Joscha
e7a51decb0 Elaborate on transforms and implement changes 2021-04-29 20:24:18 +02:00