- Konstanz, Germany
- https://de.linkedin.com/pub/sebastian-nagel/35/320/8b4
-
tika Public
Forked from apache/tikaMirror of Apache Tika
Java Apache License 2.0 UpdatedNov 14, 2024 -
crawler-commons Public
Forked from crawler-commons/crawler-commonsA set of reusable Java components that implement functionality common to any web crawler
Java Apache License 2.0 UpdatedNov 12, 2024 -
webarchive-commons Public
Forked from iipc/webarchive-commonsCommon web archive utility code.
Java Apache License 2.0 UpdatedNov 9, 2024 -
jwarc Public
Forked from iipc/jwarcJava library for reading and writing WARC files with a typed API
Java Apache License 2.0 UpdatedNov 8, 2024 -
nutch Public
Forked from apache/nutchMirror of Apache Nutch
-
zip2gz Public
Forked from patrikaxelsson/zip2gzCreate a file tree with the raw data from a zip file in usable format
Python UpdatedOct 16, 2024 -
pga-declarations Public
Forked from OpenTermsArchive/pga-declarationsDeclarations of terms of major social media platforms. Maintained by the Platform Governance Archive team, University of Bremen.
JavaScript GNU Affero General Public License v3.0 UpdatedJul 30, 2024 -
selenium_test_demo_tu2txt Public
Forked from suneecat/selenium_test_demo_tu2txtPython Apache License 2.0 UpdatedJul 29, 2024 -
-
warc-crawler Public
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
-
storm-crawler Public
Forked from apache/incubator-stormcrawlerWeb crawler SDK based on Apache Storm
-
-
-
cc2dataset Public
Forked from rom1504/cc2datasetEasily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
Python MIT License UpdatedMar 1, 2023 -
duckdb-web Public
Forked from duckdb/duckdb-webDuckDB-Web - Source code of duckdb.org
JavaScript UpdatedJan 30, 2023 -
browsertrix-crawler Public
Forked from webrecorder/browsertrix-crawlerRun a high-fidelity browser-based crawler in a single Docker container
-
-
wdc-page Public
Forked from wbsg-uni-mannheim/wdc-pageThis repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl
HTML UpdatedNov 16, 2022 -
sfm-facebook-harvester Public archive
Forked from fgremler/sfm-facebook-harvesterPython UpdatedNov 7, 2022 -
Experiments and metrics about robots.txt captures, presentation at #ossym2022
Jupyter Notebook MIT License UpdatedOct 11, 2022 -
twarc-csv Public
Forked from DocNow/twarc-csvA plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
Python MIT License UpdatedJul 6, 2022 -
sfm-twitter-harvester Public
Forked from gwu-libraries/sfm-twitter-harvesterA harvester for twitter content as part of Social Feed Manager.
Python MIT License UpdatedJul 5, 2022 -
sfm-ui Public
Forked from gwu-libraries/sfm-uiSocial Feed Manager user interface application.
Python MIT License UpdatedJul 4, 2022 -
sfm-utils Public
Forked from gwu-libraries/sfm-utilsUtilities to support Social Feed Manager
-
sfm-instagram-harvester Public
Forked from fgremler/sfm-instagram-harvesterPython UpdatedMay 23, 2022 -
news-please Public
Forked from fhamborg/news-pleasenews-please - an integrated web crawler and information extractor for news that just works.
-
-
pywb Public
Forked from webrecorder/pywbPython WayBack for web archive replay and url-rewriting HTTP/S web proxy
-
data_tooling Public
Forked from bigscience-workshop/data_toolingTools for managing datasets for governance and training.
HTML Apache License 2.0 UpdatedDec 20, 2021 -
sfm-docker Public
Forked from gwu-libraries/sfm-dockerDocker support for Social Feed Manager.
Shell MIT License UpdatedNov 2, 2021