Python packages for Big Data | 𝟐𝟎𝟐𝟎 | 𝐍𝐞𝐰𝐛𝐲𝐂𝐨𝐝𝐞𝐫.𝐜𝐨𝐦

Alternative Big Data libraries for Python

vaex

Github stargazers

8289

Github forks

589

Commits

3636

Code contributors Contributors

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Created

Sept. 27, 2014

Updated

Sept. 27, 2024

License

mit

Github repo

Primary Language, based on Github DataLanguage

Python

Issues

533

Homepage

vaex.io

BigDL

Github stargazers

6643

Github forks

Commits

21025

Code contributors Contributors

104

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc

Created

Aug. 29, 2016

Updated

Sept. 29, 2024

License

apache-2.0

Github repo

Primary Language, based on Github DataLanguage

Python

Issues

1274

root

Github stargazers

2686

Github forks

Commits

80008

Code contributors Contributors

414

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

Created

June 27, 2013

Updated

Sept. 27, 2024

License

other

Github repo

Primary Language, based on Github DataLanguage

C++

Issues

811

Homepage

root.cern

data-structures-algorithms-python

Github stargazers

1221

Github forks

1506

Commits

Code contributors Contributors

This tutorial playlist covers data structures and algorithms in python. Every tutorial has theory behind data structure or an algorithm, BIG O Complexity analysis and exercises that you can practice on.

Created

Sept. 29, 2020

Updated

Nov. 14, 2022

Github repo

Primary Language, based on Github DataLanguage

Jupyter

Issues

tuplex

Github stargazers

813

Github forks

Commits

Code contributors Contributors

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Created

June 30, 2021

Updated

Dec. 22, 2023

License

apache-2.0

Github repo

Primary Language, based on Github DataLanguage

C++

Issues

Homepage

tuplex.cs.brown.edu

oio-sds

Github stargazers

662

Github forks

Commits

7256

Code contributors Contributors

High Performance Software-Defined Object Storage for Big Data and AI, that supports Amazon S3 and Openstack Swift

Created

March 13, 2015

Updated

Sept. 27, 2024

License

other

Github repo

Primary Language, based on Github DataLanguage

Python

Issues

Homepage

openio.io

eland

Github stargazers

644

Github forks

Commits

513

Code contributors Contributors

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

Created

June 11, 2019

Updated

Sept. 27, 2024

License

apache-2.0

Github repo

Type

Module/library

Primary Language, based on Github DataLanguage

Python

Issues

Homepage

eland.readthedocs.io

Herbie

Github stargazers

491

Github forks

Commits

993

Code contributors Contributors

Download numerical weather prediction datasets (HRRR, RAP, GFS, IFS, etc.) from NOMADS, NODD partners (Amazon, Google, Microsoft), ECMWF open data, and the University of Utah Pando Archive System.

Created

June 26, 2020

Updated

Aug. 30, 2024

License

mit

Github repo

Primary Language, based on Github DataLanguage

Python

Issues

Homepage

herbie.readthedocs.io

transbigdata

Github stargazers

464

Github forks

114

Commits

670

Code contributors Contributors

A Python package develop for transportation spatio-temporal big data processing, analysis and visualization.

Created

Oct. 21, 2021

Updated

Oct. 28, 2023

License

bsd-3-clause

Github repo

Primary Language, based on Github DataLanguage

Python

Issues

Homepage

transbigdata.readthedocs.io

python-bigquery-pandas

Github stargazers

447

Github forks

121

Commits

383

Code contributors Contributors

Google BigQuery connector for pandas

Created

Feb. 8, 2017

Updated

Sept. 23, 2024

License

bsd-3-clause

Github repo

Primary Language, based on Github DataLanguage

Python

Issues

Homepage

googleapis.dev

hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference

Github stargazers

406

Github forks

144

Commits

236

Code contributors Contributors

Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required

Created

May 8, 2019

Updated

Nov. 5, 2020

License

apache-2.0

Github repo

Type

Module/library

Primary Language, based on Github DataLanguage

Jupyter

Issues

arvados

Github stargazers

397

Github forks

116

Commits

30473

Code contributors Contributors

An open source platform for managing and analyzing biomedical big data

Created

April 11, 2013

Updated

Sept. 27, 2024

License

other

Github repo

Primary Language, based on Github DataLanguage

Issues

Homepage

arvados.org

lithops

Github stargazers

317

Github forks

105

Commits

3968

Code contributors Contributors

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

Created

April 23, 2018

Updated

Sept. 3, 2024

License

apache-2.0

Github repo

Primary Language, based on Github DataLanguage

Python

Issues

Homepage

lithops.cloud

PythonDataScienceFullThrottle

Github stargazers

258

Github forks

220

Commits

180

Code contributors Contributors

Downloads for my Safari Online Learning live training course Python Data Science Full Throttle: Introductory Artificial Intelligence (AI), Big Data and Cloud Case Studies

Created

July 18, 2019

Updated

Aug. 13, 2024

Github repo

Type

Resource

Primary Language, based on Github DataLanguage

Jupyter

Issues

gimel

Github stargazers

245

Github forks

Commits

167

Code contributors Contributors

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Created

April 4, 2018

Updated

Nov. 8, 2022

License

apache-2.0

Github repo

Primary Language, based on Github DataLanguage

Scala

Issues

Homepage

gimel.io

bigdata-playground

Github stargazers

208

Github forks

Commits

465

Code contributors Contributors

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

Created

Dec. 12, 2017

Updated

Feb. 1, 2019

License

apache-2.0

Github repo

Type

App

Primary Language, based on Github DataLanguage

TypeScript

Issues

.config

Github stargazers

204

Github forks

Commits

Code contributors Contributors

# # Automatically generated file; DO NOT EDIT. # OpenWrt Configuration # CONFIG_MODULES=y CONFIG_HAVE_DOT_CONFIG=y # CONFIG_TARGET_sunxi is not set # CONFIG_TARGET_apm821xx is not set # CONFIG_TARGET_ath25 is not set CONFIG_TARGET_ar71xx=y # CONFIG_TARGET_ath79 is not set # CONFIG_TARGET_bcm27xx is not set # CONFIG_TARGET_bcm53xx is not set # CONFIG_TARGET_b

Created

June 23, 2020

Updated

June 23, 2020

License

mit

Github repo

Primary Language, based on Github DataLanguage

Shell

Issues

WallStreetBets_BigDataAnalysis

Github stargazers

173

Github forks

Commits

Code contributors Contributors

Research project aimed to classify the best stock research posts from r/WallStreetBets for you. 😏

Created

March 15, 2021

Updated

May 16, 2021

Github repo

Primary Language, based on Github DataLanguage

Jupyter

Issues

Homepage

wsbrecommender.web.app

ai-flow

Github stargazers

170

Github forks

Commits

869

Code contributors Contributors

AI Flow is an open source framework that bridges big data and artificial intelligence.

Created

Oct. 14, 2021

Updated

Oct. 9, 2022

License

apache-2.0

Github repo

Primary Language, based on Github DataLanguage

Python

Issues

deltacat

Github stargazers

148

Github forks

Commits

324

Code contributors Contributors

A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.

Created

Aug. 11, 2021

Updated

Sept. 23, 2024

License

apache-2.0

Github repo

Primary Language, based on Github DataLanguage

Python

Issues