Commit Graph

414 Commits

Author SHA1 Message Date
d63494908d Properly invalidate exceptions
The simple authenticator now properly invalidates its credentials. Also, the
invalidation functions have been given better names and documentation.
2021-05-15 17:37:05 +02:00
b70b62cef5 Make crawler sections start with "crawl:"
Also, use only the part of the section name after the "crawl:" as the crawler's
output directory. Now, the implementation matches the documentation again
2021-05-15 17:24:37 +02:00
868f486922 Rename local crawler path to target 2021-05-15 17:12:25 +02:00
b2a2b5999b Implement ILIAS auth and crawl home page
This commit introduces the necessary machinery to authenticate with
ILIAS and crawl the home page.

It can't do much yet and just silently fetches the homepage.
2021-05-15 15:25:05 +02:00
595de88d96 Fix authenticator and crawler names
Now, the "auth:" and "crawl:" parts are considered part of the name. This fixes
crawlers not being able to find their authenticators.
2021-05-15 15:25:05 +02:00
a6fdf05ee9 Allow variable whitespace in arrow rules 2021-05-15 15:25:05 +02:00
f897d7c2e1 Add name variants for all arrows 2021-05-15 15:25:05 +02:00
b0f731bf84 Make crawlers use transformers 2021-05-15 15:25:05 +02:00
302b8c0c34 Fix errors loading local crawler config
Apparently getint and getfloat may return a None even though this is not
mentioned in their type annotations.
2021-05-15 15:25:05 +02:00
acd674f0a0 Change limiter logic
Now download tasks are a subset of all tasks.
2021-05-15 15:25:05 +02:00
b0f9e1e8b4 Add vscode directory to gitignore 2021-05-15 15:25:05 +02:00
ed2e19a150 Add reasons for invalid values 2021-05-15 15:25:05 +02:00
296a169dd3 Make limiter logic more complex
The limiter can now distinguish between crawl and download actions and has a
fancy slot system and delay logic.
2021-05-15 15:25:05 +02:00
1591cb9197 Add options to slow down local crawler
These options are meant to make the local crawler behave more like a
network-based crawler for purposes of testing and debugging other parts of the
code base.
2021-05-15 15:25:01 +02:00
0c9167512c Fix output dir
I missed these while renaming the resolve function. Shame on me for not running
mypy earlier.
2021-05-14 21:28:38 +02:00
a673ab0fae Delete old files
I should've done this earlier
2021-05-14 21:27:44 +02:00
6e5fdf4e9e Set user agent to "pferd/<version>" 2021-05-14 21:27:44 +02:00
93a5a94dab Single-source version number 2021-05-14 21:27:44 +02:00
d565df27b3 Add HttpCrawler 2021-05-13 22:28:14 +02:00
961f40f9a1 Document simple authenticator 2021-05-13 19:55:04 +02:00
e3ee4e515d Disable highlighting of primitives
This commit prevents rich from highlighting python-looking syntax like numbers,
arrays, 'None' etc.
2021-05-13 19:47:44 +02:00
94d6a01cca Use file mtime in local crawler 2021-05-13 19:42:40 +02:00
38bb66a776 Update file metadata in more cases
PFERD now not only updates file metadata when a file is successfully added or
changed, but also when a file is downloaded and then detected to be unchanged.

This could occur for example if a remote file's modification time was bumped,
possibly because somebody touched the file without changing it.
2021-05-13 19:40:10 +02:00
68781a88ab Fix asynchronous methods being not awaited 2021-05-13 19:39:49 +02:00
910462bb72 Log stuff happening to files 2021-05-13 19:37:27 +02:00
6bd6adb977 Fix tmp file names 2021-05-13 19:36:46 +02:00
0acdee15a0 Let crawlers obtain authenticators 2021-05-13 18:57:20 +02:00
c3ce6bb31c Fix crawler cleanup not being awaited 2021-05-11 00:28:45 +02:00
0459ed093e Add simple authenticator
... including some required authenticator infrastructure
2021-05-11 00:28:03 +02:00
d5f29f01c5 Use global conductor instance
The switch from crawler-local conductors to a single pferd-global conductor was
made to prepare for auth section credential providers.
2021-05-11 00:05:04 +02:00
595ba8b7ab Remove dummy crawler 2021-05-10 23:47:46 +02:00
cec0a8e1fc Fix mymy errors 2021-05-09 01:45:01 +02:00
f9b2fd60e2 Document local crawler and auth 2021-05-09 01:33:47 +02:00
60cd9873bc Add local file crawler 2021-05-06 01:02:40 +02:00
273d56c39a Properly load crawler config 2021-05-05 23:45:10 +02:00
5497dd2827 Add @noncritical and @repeat decorators 2021-05-05 23:36:54 +02:00
bbfdadc463 Implement output directory 2021-05-05 18:08:34 +02:00
fde811ae5a Document on_conflict option 2021-05-05 12:24:35 +02:00
07e831218e Add sync report 2021-05-02 00:56:10 +02:00
91c33596da Load crawlers from config file 2021-04-30 16:22:14 +02:00
a8dcf941b9 Document possible redownload settings 2021-04-30 15:32:56 +02:00
e7a51decb0 Elaborate on transforms and implement changes 2021-04-29 20:24:18 +02:00
9ec19be113 Document config file format 2021-04-29 20:24:18 +02:00
f776186480 Use PurePath instead of Path
Path should only be used when we need to access the file system. For all other
purposes (mainly crawling), we use PurePath instead since the paths don't
correspond to paths in the local file system.
2021-04-29 20:20:25 +02:00
0096d83387 Simplify Limiter implementation 2021-04-29 20:20:25 +02:00
20a24dbcbf Add changelog 2021-04-29 20:20:25 +02:00
502654d853 Fix mypy errors 2021-04-29 15:47:52 +02:00
d2103d7c44 Document crawler 2021-04-29 15:43:20 +02:00
d96a361325 Test and fix exclusive output 2021-04-29 15:27:16 +02:00
2e85d26b6b Use conductor via context manager 2021-04-29 14:23:28 +02:00