数据分析 python准备工作
python 2
pip install ipython[notebook]
pip install numpy
pip install matplotlib
pip install seaborn
Python3
pip3 install numpy
pip3 install matplotlib
pip3 install seaborn
pip3 install --upgrade pip
pip3 install jupyter
jupyter notebook #启动jupter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
import csv
import json
from lxml.html import parse
from urlib2 import urlopen
from lxml import objectify
from StringIO import StringIO
import requests
%matplotlib inline '''启用pylab模式,可以看图形效果'''
jupyter notebook %pylab inline
ipython --pylab
IPython 基本
- Tab 自动补齐
- 变量? 查看变量通用信息
- 函数名? 查看函数docstring
- 函数名?? 查看函数源代码
- %run xxx.py
- %paste, %cpaste
- Ctrl+L 清屏
- %debug, %pdb
- %hist
- %quickref
- %magic
IPython 调试
- %debug 进入调试器
- u(up), d(down) 在栈跟踪的各个级别之间切换
- %pdb 出现异常后自动调用调试器
- %run -d xxx.py, s(step)进入脚本, b(break)设置断点, c(continue)是脚本一直运行直到断点,n(next)执行下一行
IPython 性能
- %time, %timeit
- %prun, %run -p
- %lprun 逐行分析性能
重要的Python库 (for data analysis)
NumPy
- ndarray: N维数组对象
- ndarray.shape
- ndarray.dtype
- np.array()
- np.arange()
- ndarray: N维数组对象
pandas
- Series: 类似于一维数组对象/有序字典
- Series.values
- Series.index
- DataFrame: 表格行数据,它的每一列就是一个Series
- DataFrame(dict_data/二维ndarray/Series,column=[‘c1’,‘c2’,‘c3’],index=[‘r1’,‘r2’,‘r3’])
- Series: 类似于一维数组对象/有序字典
数据加载
- pd.read_csv()
- pd.read_table()
- chunk
- df.to_csv()
matplotlib
fig=plt.figure() ax=fig.add_subplot(1,1,1)
fig, axes=plt.subplots(2,3)
ax.plot(x,y,‘go–’) g=green, o=maker, –=线型
ax.set_xticks,