基础技能

Python

Python 官方文档

JavaScript

JavaScript 文档

现代 JavaScript 教程

以最新的 JavaScript 标准为基准。通过简单但足够详细的内容,为你讲解从基础到高阶的 JavaScript 相关知识。

Java

Java 文档

C/C++

C/C++ 文档

Node.js

Node.js 文档

GO

GO 文档


爬取技能

Urllib

Python Urllib 文档

urllib3

urllib3 is a powerful, sanity-friendly HTTP client for Python. Much of the Python ecosystem already uses urllib3 and you should too. urllib3 brings many critical features that are missing from the Python standard libraries

httplib2

The httplib2 module is a comprehensive HTTP client library that handles caching, keep-alive, compression, redirects and many kinds of authentication.

Requests

Python Requests 文档

aiohttp

Asynchronous HTTP Client/Server for asyncio and Python.

PySpider

PySpider 爬虫框架官方文档

Scrapy

Scrapy 爬虫框架官方文档

requests-html

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

pyppeteer

Unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library.

selenium

Selenium 是支持 web 浏览器自动化的一系列工具和库的综合项目。

splash

Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. The (twisted) QT reactor is used to make the service fully asynchronous allowing to take advantage of webkit concurrency via QT main loop

js2py

Everything is done in 100% pure Python so it's extremely easy to install and use. Supports Python 2 & 3. Full support for ECMAScript 5.1, ECMA 6 support is still experimental.

pyexecjs

Run JavaScript code from Python.

asyncio

asyncio 是用来编写 并发 代码的库,使用 async/await 语法。

gevent

gevent is a coroutine -based Python networking library that uses greenlet to provide a high-level synchronous API on top of the libev or libuv event loop.

Tornado

Tornado is a Python web framework and asynchronous networking library, originally developed at FriendFeed.

Twisted

Twisted is an event-driven networking engine written in Python


解析技能

re

Python 正则表达式官方文档

lxml

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt.

BeautifulSoup4

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

cssselect2

cssselect2 is a straightforward implementation of CSS3 Selectors for markup documents (HTML, XML, etc.) that can be read by ElementTree-like parsers (including cElementTree, lxml, html5lib_, etc.)

html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

pyquery

pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.

feedparser

Universal Feed Parser is a Python module for downloading and parsing syndicated feeds.

goose3

goose3

newspaper

Article scraping & curation

ocrmypdf

OCRmyPDF adds an optical charcter recognition (OCR) text layer to scanned PDF files, allowing them to be searched.

pdfminer.six

Pdfminer.six is a python package for extracting information from PDF documents.

pydub

Manipulate audio with a simple and easy high level interface

pyyaml

PyYAML is a YAML parser and emitter for Python.

readability

Measure the readability of a given text using surface characteristics

scrapely

A pure-python HTML screen-scraping library

untangle

untangle is a tiny Python library which converts an XML document to a Python object.

xml2dict

convert xml file to python native dict object


清洗技能

Numpy

Numpy 科学计算 官方中文文档

Pandas

Pandas 结构化数据分析 官方中文文档

jieba

结巴中文分词

Matplotlib

Matplotlib 2D绘图库 官方中文文档

gensim

Gensim is a FREE Python library

nameparser

A simple Python module for parsing human names into their individual components.

nltk

NLTK is a leading platform for building Python programs to work with human language data.

phonenumbers

Python port of Google's libphonenumber

PyNLPIR

PyNLPIR is a Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.

snownlp

SnowNLP是一个python写的类库,可以方便的处理中文文本内容

thulac

An Efficient Lexical Analyzer for Chinese

xpinyin

translate chinese hanzi to pinyin by python, inspired by flyerhzm’s chinese_pinyin gem


存储技能

MongoDB

MongoDB API 文档

pymongo

PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python

Redis

Redis API 文档

Redis

The Python interface to the Redis key-value store.

MySQL

MySQL 文档

pymssql

A simple database interface for Python that builds on top of FreeTDSto provide a Python DB-API (PEP-249) interface to Microsoft SQL Server.

pymysql

Python Mysql Client

cxOracle

cx_Oracle is a Python extension module that enables access to Oracle Database.

elasticsearch

Python Elasticsearch Client

json

JSON (JavaScript Object Notation), specified by RFC 7159 (which obsoletes RFC 4627) and by ECMA-404, is a lightweight data interchange format inspired byJavaScript object literal syntax

mistune

A fast yet powerful Python Markdown parser with renderers and plugins, compatible with sane CommonMark rules.

psycopg2

Python adapter for PostgreSQL

py2neo

Py2neo is a client library and toolkit for working with Neo4j from within Python applications and from the command line.

pyodbc

Python ODBC bridge

pypdf2

A Pure-Python library built as a PDF toolkit.

thrift

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

xlrd

This package is for reading data and formatting information from older Excel files

xlwt

xlwt is a library for writing data and formatting information to older Excel files (ie: .xls)


反爬工具

AST explorer

AST explorer

JavaScript AST visualizer

JavaScript AST visualizer

js code to svg flowchart

js-code-to-svg-flowchart

阿里读光

阿里出品的在线图片 OCR 识别应用


加速技能

scrapy-redis

Redis-based components for Scrapy.

kafka

Python client for the Apache Kafka distributed stream processing system. kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces (e.g., consumer iterators).

celery

Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.

multiprocessing

multiprocessing is a package that supports spawning processes using an API similar to the threading module.

subprocess

The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.

threading

This module constructs higher-level threading interfaces on top of the lower level _thread module. See also the queue module.

fork

Doing subprocess in Python should be easy

huey

a lightweight alternative.

rabbitmq

RabbitMQ是实现了高级消息队列协议(AMQP)的开源消息代理软件(亦称面向消息的中间件)。

rq (Redis Queue)

RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background with workers.


部署技能

docker

Learn how Docker helps developers bring their ideas to life by conquering the complexity of app development.

kuberneters

Kubernetes 是用于自动部署,扩展和管理容器化应用程序的开源系统。

openshift

Red Hat OpenShift is an open source container application platform based on the Kubernetes container orchestrator for enterprise app development and deployment.

python-scrapyd-api

python-scrapyd-api is a very simple Python wrapper for working withScrapyd‘s API;it allows a Python application to talk to, and therefore control, the Scrapy Daemon.

scrapyd

Scrapyd is an application for deploying and running Scrapy spiders. It enables you to deploy (upload) your projects and control their spiders using a JSON API.

scrapyd-client

Scrapyd-client is a client for scrapyd.

scrapydweb

用于 Scrapyd 集群管理的 web 应用,支持 Scrapy 日志分析和可视化。


爬取工具

anyproxy

AnyProxy是一个开放式的HTTP代理服务器。

Appium

Mobile App Automation Made Awesome.

Charles

Charles is an HTTP proxy / HTTP monitor / Reverse Proxy that enables a developer to view all of the HTTP and SSL / HTTPS traffic between their machine and the Internet.

Google Chrome

Google Chrome 网络浏览器

Microsoft Edge

Google Chrome 网络浏览器

Fiddler

Fiddler is a free web debugging tool which logs all HTTP(S) traffic between your computer and the Internet. Inspect traffic, set breakpoints, and fiddle with incoming or outgoing data.

mitmproxy

mitmproxy is a free and open source interactive HTTPS proxy.

wireshark

Wireshark is a network packet analyzer. A network packet analyzer presents captured packet data in as much detail as possible.


浏览器插件

EditThisCookie

EditThisCookie is a cookie manager. You can add, delete, edit, search, protect and block cookies!

Tampermonkey

Tampermonkey is the most popular userscript manager, with over 10 million weekly users. It's available for Microsoft Edge, Chrome, Safari, Opera Next, and Firefox.

ReRes

ReRes可以用来更改页面请求响应的内容。通过指定规则,您可以把请求映射到其他的url,也可以映射到本机的文件或者目录。ReRes支持单个url映射,也支持目录映射。

XPath Helper

Extract, edit, and evaluate XPath queries with ease.

Proxy SwitchyOmega

轻松快捷地管理和切换多个代理设置。

JSON Formatter

Makes JSON easy to read. Open source.