Address: [go: up one dir, main page]

Include Form Remove Scripts Accept Cookies Show Images Show Referer Rotate13 Base64 Strip Meta Strip Title Session Cookies

Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

sebastian-nagel Follow

Overview Repositories 60 Projects 0 Packages 0 Stars 68

More

Overview
Repositories
Projects
Packages
Stars

sebastian-nagel

Follow

Sebastian Nagel sebastian-nagel

Follow

115 followers · 3 following

@commoncrawl
Konstanz, Germany
https://de.linkedin.com/pub/sebastian-nagel/35/320/8b4

Achievements

Achievements

Block or report sebastian-nagel

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Add an optional note:

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Overview Repositories 60 Projects 0 Packages 0 Stars 68

More

Overview
Repositories
Projects
Packages
Stars

Type All

Select type

All Sources Forks Archived Can be sponsored Mirrors Templates

Language All

Select language

All Java Python JavaScript Shell FLUX HTML Jupyter Notebook Scala C++ C

Sort Last updated

Select order

Last updated Name Stars

tika Public
Forked from apache/tika

Mirror of Apache Tika

Java Apache License 2.0 Updated Nov 14, 2024
crawler-commons Public
Forked from crawler-commons/crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Java Apache License 2.0 Updated Nov 12, 2024
webarchive-commons Public
Forked from iipc/webarchive-commons

Common web archive utility code.

Java Apache License 2.0 Updated Nov 9, 2024
jwarc Public
Forked from iipc/jwarc

Java library for reading and writing WARC files with a typed API

Java Apache License 2.0 Updated Nov 8, 2024
nutch Public
Forked from apache/nutch

Mirror of Apache Nutch

Java 2 Apache License 2.0 Updated Oct 27, 2024
zip2gz Public
Forked from patrikaxelsson/zip2gz

Create a file tree with the raw data from a zip file in usable format

Python Updated Oct 16, 2024
pga-declarations Public
Forked from OpenTermsArchive/pga-declarations

Declarations of terms of major social media platforms. Maintained by the Platform Governance Archive team, University of Bremen.

JavaScript GNU Affero General Public License v3.0 Updated Jul 30, 2024
selenium_test_demo_tu2txt Public
Forked from suneecat/selenium_test_demo_tu2txt

Python Apache License 2.0 Updated Jul 29, 2024
nutch-test-single-node-cluster Public

Shell 4 1 Apache License 2.0 Updated Apr 11, 2024
warc-crawler Public

Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr

elasticsearch solr apache-storm warc web-archives warc-files stormcrawler

FLUX 8 1 Updated Nov 24, 2023
storm-crawler Public
Forked from apache/incubator-stormcrawler

Web crawler SDK based on Apache Storm

HTML 1 Apache License 2.0 Updated Oct 2, 2023
sitemap-performance-test Public

Java Apache License 2.0 Updated Jul 13, 2023
sfm-web-harvester-browsertrix Public

Python Updated May 30, 2023
cc2dataset Public
Forked from rom1504/cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

Python MIT License Updated Mar 1, 2023
duckdb-web Public
Forked from duckdb/duckdb-web

DuckDB-Web - Source code of duckdb.org

JavaScript Updated Jan 30, 2023
browsertrix-crawler Public
Forked from webrecorder/browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

JavaScript 1 GNU Affero General Public License v3.0 Updated Dec 5, 2022
suffix_cat Public
Forked from lukaskawerau/suffix_cat

Python Updated Nov 29, 2022
wdc-page Public
Forked from wbsg-uni-mannheim/wdc-page

This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl

HTML Updated Nov 16, 2022
sfm-facebook-harvester Public archive
Forked from fgremler/sfm-facebook-harvester

Python Updated Nov 7, 2022
ossym2022-robotstxt-experiments Public

Experiments and metrics about robots.txt captures, presentation at #ossym2022

Jupyter Notebook MIT License Updated Oct 11, 2022
twarc-csv Public
Forked from DocNow/twarc-csv

A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.

Python MIT License Updated Jul 6, 2022
sfm-twitter-harvester Public
Forked from gwu-libraries/sfm-twitter-harvester

A harvester for twitter content as part of Social Feed Manager.

Python MIT License Updated Jul 5, 2022
sfm-ui Public
Forked from gwu-libraries/sfm-ui

Social Feed Manager user interface application.

Python MIT License Updated Jul 4, 2022
sfm-utils Public
Forked from gwu-libraries/sfm-utils

Utilities to support Social Feed Manager

Python 1 MIT License Updated May 23, 2022
sfm-instagram-harvester Public
Forked from fgremler/sfm-instagram-harvester

Python Updated May 23, 2022
news-please Public
Forked from fhamborg/news-please

news-please - an integrated web crawler and information extractor for news that just works.

Python 1 1 Apache License 2.0 Updated May 7, 2022
introduction-to-python Public

Jupyter Notebook 2 Apache License 2.0 Updated Jan 30, 2022
pywb Public
Forked from webrecorder/pywb

Python WayBack for web archive replay and url-rewriting HTTP/S web proxy

Python 1 GNU General Public License v3.0 Updated Jan 10, 2022
data_tooling Public
Forked from bigscience-workshop/data_tooling

Tools for managing datasets for governance and training.

HTML Apache License 2.0 Updated Dec 20, 2021
sfm-docker Public
Forked from gwu-libraries/sfm-docker

Docker support for Social Feed Manager.

Shell MIT License Updated Nov 2, 2021

Previous Next

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.