Commit Graph

39 Commits

Author SHA1 Message Date
Toorero
d6f38a61e1 Fixed minor spelling mistakes 2021-11-02 01:54:00 +01:00
lukasprobst
55ea304ff3 Disable interpolation of ConfigParser 2021-10-25 23:37:42 +02:00
I-Al-Istannen
6673077397 Add kit-ipd crawler 2021-10-21 13:20:21 +02:00
Joscha
70b33ecfd9 Add migration notes to changelog
Also clean up some other formatting for consistency
2021-06-13 15:06:50 +02:00
Joscha
a292c4c437 Add example for ">>" arrow heads 2021-06-12 14:57:29 +02:00
Joscha
f28bbe6b0c Update transform rule documentation
It's still missing an example that uses rules with ">>" arrows.
2021-06-09 22:45:52 +02:00
Joscha
df3ad3d890 Add 'skip' option to crawlers 2021-06-04 18:47:13 +02:00
Joscha
1fc8e9eb7a Document credential file authenticator config options 2021-06-01 10:01:14 +00:00
Joscha
9d5ec84b91 Add credential file authenticator 2021-05-31 18:33:34 +02:00
Joscha
6fa9cfd4c3 Fix error when capturing group is None 2021-05-27 15:41:00 +02:00
Joscha
2c72a9112c Reword -name-> and -name-re-> docs and remove -name-exact-> 2021-05-27 13:20:37 +02:00
Joscha
17207546e9 Document --debug-transforms 2021-05-26 11:47:51 +02:00
Joscha
c665c36d88 Update README, CHANGELOG 2021-05-25 17:18:31 +02:00
Joscha
61430c8739 Overhaul config and CLI option names 2021-05-25 14:23:38 +02:00
Joscha
bce3dc384d Deduplicate path names in crawler
Also rename files so they follow the restrictions for windows file names if
we're on windows.
2021-05-25 12:11:15 +02:00
Joscha
c687d4a51a Implement cookie sharing 2021-05-24 13:10:44 +02:00
I-Al-Istannen
3ab3581f84 Add timeout for HTTP connection 2021-05-23 23:41:05 +02:00
Joscha
be4b1040f8 Document status and report options 2021-05-23 22:51:42 +02:00
I-Al-Istannen
6e9f8fd391 Add a keyring authenticator 2021-05-23 19:44:12 +02:00
I-Al-Istannen
ecdedfa1cf Add no-videos flag to ILIAS crawler 2021-05-23 12:37:01 +02:00
Joscha
0d10752b5a Configure explain log level via cli and config file 2021-05-19 17:50:10 +02:00
I-Al-Istannen
db1219d4a9 Create a link file in ILIAS crawler
This allows us to crawl links and represent them in the file system.
Users can choose between an ILIAS-imitation (that optionally
auto-redirects) and a plain text variant.
2021-05-17 21:44:54 +02:00
I-Al-Istannen
467ea3a37e Document ILIAS-Crawler arguments in CONFIG.md 2021-05-16 13:26:58 +02:00
Joscha
e1104f888d Add tfa authenticator 2021-05-15 18:27:16 +02:00
Joscha
8c32da7f19 Let authenticators provide username and password separately 2021-05-15 18:27:03 +02:00
Joscha
b70b62cef5 Make crawler sections start with "crawl:"
Also, use only the part of the section name after the "crawl:" as the crawler's
output directory. Now, the implementation matches the documentation again
2021-05-15 17:24:37 +02:00
Joscha
868f486922 Rename local crawler path to target 2021-05-15 17:12:25 +02:00
Joscha
a6fdf05ee9 Allow variable whitespace in arrow rules 2021-05-15 15:25:05 +02:00
Joscha
f897d7c2e1 Add name variants for all arrows 2021-05-15 15:25:05 +02:00
Joscha
302b8c0c34 Fix errors loading local crawler config
Apparently getint and getfloat may return a None even though this is not
mentioned in their type annotations.
2021-05-15 15:25:05 +02:00
Joscha
acd674f0a0 Change limiter logic
Now download tasks are a subset of all tasks.
2021-05-15 15:25:05 +02:00
Joscha
296a169dd3 Make limiter logic more complex
The limiter can now distinguish between crawl and download actions and has a
fancy slot system and delay logic.
2021-05-15 15:25:05 +02:00
Joscha
1591cb9197 Add options to slow down local crawler
These options are meant to make the local crawler behave more like a
network-based crawler for purposes of testing and debugging other parts of the
code base.
2021-05-15 15:25:01 +02:00
Joscha
961f40f9a1 Document simple authenticator 2021-05-13 19:55:04 +02:00
Joscha
f9b2fd60e2 Document local crawler and auth 2021-05-09 01:33:47 +02:00
Joscha
fde811ae5a Document on_conflict option 2021-05-05 12:24:35 +02:00
Joscha
a8dcf941b9 Document possible redownload settings 2021-04-30 15:32:56 +02:00
Joscha
e7a51decb0 Elaborate on transforms and implement changes 2021-04-29 20:24:18 +02:00
Joscha
9ec19be113 Document config file format 2021-04-29 20:24:18 +02:00