python画分布、密度等图形_python 频率分布密度图

最简单的hist (直方图)

最简单的hist是使用一列数据(series)作为输入, 也不用考虑其它的参数.

data = randn(75)
plt.hist(data)
(array([  2.,   5.,   4.,  10.,  12.,  16.,   7.,   7.,   6.,   6.]),
 array([-2.04713616, -1.64185099, -1.23656582, -0.83128065, -0.42599548,
        -0.02071031,  0.38457486,  0.78986003,  1.1951452 ,  1.60043037,
         2.00571554]),
 <a list of 10 Patch objects>)

# 增加一些参数, 就能画出别样的风采
data = randn(100)
plt.hist(data, bins=12, color=sns.desaturate("indianred", .8), alpha=.4)
(array([  2.,   3.,   3.,  11.,  10.,  15.,  10.,  17.,  10.,   8.,   7.,
          4.]),
 array([-2.56765228, -2.1665249 , -1.76539753, -1.36427015, -0.96314278,
        -0.5620154 , -0.16088803,  0.24023935,  0.64136672,  1.0424941 ,
         1.44362147,  1.84474885,  2.24587623]),
 <a list of 12 Patch objects>)

# 以上数据是单总体, 双总体的hist
data1 = stats.poisson(2).rvs(100)
data2 = stats.poisson(5).rvs(500)
max_data = np.r_[data1, data2].max()
bins = np.linspace(0, max_data, max_data+1)
#plt.hist(data1) # 
# 首先将2个图形分别画到figure中
plt.hist(data1, bins, normed=True, color="#FF0000", alpha=.9)
plt.figure()
plt.hist(data2, bins, normed=True, color="#C1F320", alpha=.5)
(array([ 0.006,  0.03 ,  0.082,  0.116,  0.17 ,  0.214,  0.152,  0.098,
         0.06 ,  0.046,  0.018,  0.008]),
 array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
         11.,  12.]),
 <a list of 12 Patch objects>)

# 观察下面图形 可以看出nomed参数的作用 -- 
# 首先还是各自绘出自己的分布hist, 然后将二者重合部分用第三颜色加以区别.
plt.hist(data1, bins, normed=True, color="#FF0000", alpha=.9)
plt.hist(data2, bins, normed=True, color="#C1F320", alpha=.5)
(array([ 0.006,  0.03 ,  0.082,  0.116,  0.17 ,  0.214,  0.152,  0.098,
         0.06 ,  0.046,  0.018,  0.008]),
 array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
         11.,  12.]),
 <a list of 12 Patch objects>)

# hist 其它参数
x = stats.gamma(3).rvs(5000);
#plt.hist(x, bins=80) # 每个bins都有分界线
# 若想让图形更连续化 (去除中间bins线) 用histtype参数
plt.hist(x, bins=80, histtype="stepfilled", alpha=.8)
(array([  19.,   27.,   53.,   97.,  103.,  131.,  167.,  176.,  196.,
         215.,  214.,  202.,  197.,  153.,  202.,  214.,  181.,  160.,
         175.,  179.,  148.,  148.,  117.,  130.,  125.,  122.,  100.,
         102.,   80.,   85.,   66.,   67.,   58.,   51.,   56.,   42.,
          52.,   36.,   37.,   26.,   29.,   19.,   26.,   21.,   26.,
          19.,   16.,   12.,   12.,   17.,   12.,    9.,   10.,    4.,
           4.,    6.,    4.,    7.,    3.,    6.,    1.,    3.,    3.,
           1.,    1.,    2.,    0.,    0.,    1.,    2.,    3.,    1.,
           2.,    3.,    1.,    2.,    1.,    0.,    0.,    2.]),
 array([  0.13431232,   0.28186933,   0.42942633,   0.57698333,
          0.72454033,   0.87209734,   1.01965434,   1.16721134,
          1.31476834,   1.46232535,   1.60988235,   1.75743935,
          1.90499636,   2.05255336,   2.20011036,   2.34766736,
          2.49522437,   2.64278137,   2.79033837,   2.93789538,
          3.08545238,   3.23300938,   3.38056638,   3.52812339,
          3.67568039,   3.82323739,   3.9707944 ,   4.1183514 ,
          4.2659084 ,   4.4134654 ,   4.56102241,   4.70857941,
          4.85613641,   5.00369341,   5.15125042,   5.29880742,
          5.44636442,   5.59392143,   5.74147843,   5.88903543,
          6.03659243,   6.18414944,   6.33170644,   6.47926344,
          6.62682045,   6.77437745,   6.92193445,   7.06949145,
          7.21704846,   7.36460546,   7.51216246,   7.65971947,
          7.80727647,   7.95483347,   8.10239047,   8.24994748,
          8.39750448,   8.54506148,   8.69261849,   8.84017549,
          8.98773249,   9.13528949,   9.2828465 ,   9.4304035 ,
          9.5779605 ,   9.7255175 ,   9.87307451,  10.02063151,
         10.16818851,  10.31574552,  10.46330252,  10.61085952,
         10.75841652,  10.90597353,  11.05353053,  11.20108753,
         11.34864454,  11.49620154,  11.64375854,  11.79131554,  11.93887255]),
 <a list of 1 Patch objects>)

# 上面的多总体hist 还是独立作图, 并没有将二者结合, 
# 使用jointplot就能作出联合分布图形, 即, x总体和y总体的笛卡尔积分布
# 不过jointplot要限于两个等量总体. 
# jointplot还是非常实用的, 对于两个连续型变量的分布情况, 集中趋势能非常简单的给出. 
# 比如下面这个例子
x = stats.gamma(2).rvs(5000)
y = stats.gamma(50).rvs(5000)
with sns.axes_style("dark"):
    sns.jointplot(x, y, kind="hex")

# 下面用使用真实一点的数据作个dmeo
import pandas as pd
from pandas import read_csv
df = read_csv("test.csv", index_col='index')
df[:2]

	department	typecity	product	credit	ddate	month_repay	apply_amont	month_repay_real	amor	tst_amount	salary_net	LTI	DTI	pass	deny
index
13652622	gedai	ordi	elite	CR8	2015/5/29 12:27	2000	40000	1400.90	36	30000	1365.30	21.973193	0.610366	1	0
13680088	gedai	ordi	xinxin	CR16	2015/6/3 18:38	8000	100000	3589.01	36	70000	3598.66	19.451685	0.540325	1	0

clean_df = df[df['salary_net'] < 10000]
sub_df = pd.DataFrame(data=clean_df, columns=['salary_net', 'month_repay'] )
with sns.axes_style("dark"):
    sns.jointplot('salary_net', 'month_repay', data=sub_df, kind="hex")
    plt.ylim([0, 10000])
    plt.xlim([0, 10000])

注: jointplot除了作图, 还会给出x, y的相关系数(pearson_r) 和r = 0 的假设检验p值.

下面学习新的图形: kdeplot, rugplot

# rugplot
# rugplot 是比Histogram更加直观的 "Histogram"
data = randn(80)
plt.hist(data, alpha=0.3, color='#ffffff')
sns.rugplot(data)
xx = np.linspace(-4, 4, 100)
# 计算bandwidth
bandwidth = ( ( 4*data.std() ** 5)/(3 *len(data))) ** .2
bandwidth = len(data) ** (-1. /5)
#0.416276603701     print bandwidth 
kernels = []
for d in data:
    # basis function as a gaussian PDF
    kernel = stats.norm(d, bandwidth).pdf(xx)
    kernels.append(kernel)
    # Scale for plotting
    kernel /= kernel.max()
    kernel *= .4
    plt.plot(xx, kernel, "#888888", alpha=.18)
plt.ylim(0, 1)
0.416276603701
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
# color_palette 就是要画图用的 "调色盘"
c1, c2 = sns.color_palette("husl", 3)[:2]
# summed kde
summed_kde = np.sum(kernels, axis=0)
ax1.plot(xx, summed_kde, c=c1)
sns.rugplot(data, c=c1, ax=ax1)
ax1.set_title("summed basis function")
# density estimate
scipy_kde = stats.gaussian_kde(data)(xx)
ax2.plot(xx, scipy_kde, c=c2)
sns.rugplot(data, c=c2, ax=ax2)
ax2.set_yticks([]) # no ticks of y
ax2.set_title("scipy gaussian_kde")
f.tight_layout()

有了上面的知识, 就能理解kdeplot的作用了.

sns.kdeplot(data, shade=True)
# 比较bw(bandwidth) 作用
pal = sns.blend_palette([sns.desaturate("royalblue", 0), "royalblue"], 5)
bws = [.1, .25, .5, 1, 2]
for bw, c in zip(bws, pal):
    sns.kdeplot(data, bw=bw, color=c, lw=1.8, label=bw)
plt.legend(title="kernel bandwidth value")
sns.rugplot(data, color="#CF3512")
<matplotlib.legend.Legend at 0x225db278> 
# cut, clip 参数用于对outside data ( data min左, max右) 的预测 填充
with sns.color_palette('Set2'):
    f, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8), sharex=True)
    for cut in[4, 3, 2]:
        sns.kdeplot(data, cut=cut, label=cut, lw=cut*1.5, ax=ax1)
    for clip in[1, 2, 3]:
        sns.kdeplot(data, clip=(-clip, clip), label=clip, ax=ax2) 
# 利用kdeplot来确定两个sample data 是否来自于同一总体
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(8, 6))
c1, c2, c3 = sns.color_palette('Set1', 3)
dist1, dist2, dist3 = stats.norm(0, 1).rvs((3, 100))
dist3 = pd.Series(dist3 + 2, name='dist3')
# dist1, dist2是两个近似正态数据, 拥有相同的中心和摆动程度
sns.kdeplot(dist1, shade=True, color=c1, ax=ax1)
sns.kdeplot(dist2, shade=True, color=c2, label='dist2', ax=ax1)
# dist3 分布3 是另一个近正态数据, 不过中心为2. 
sns.kdeplot(dist1, shade=True, color=c2, ax=ax2)
sns.kdeplot(dist3, shade=True, color=c3, ax=ax2)
# kdeplot 参数 cumulative
with sns.color_palette("Set1"):
    for d, label in zip(data, list("ABC")):
        sns.kdeplot(d, cumulative=True, label=label) 
# vertical 参数 把刚才的图形旋转90度
plt.figure(figsize=(4, 8))
data = stats.norm(0, 1).rvs((3, 100)) + np.arange(3)[:, None]
with sns.color_palette("Set2"):
    for d, label in zip(data, list("ABC")):
        sns.kdeplot(d, vertical=True, shade=True, label=label)
# plt.hist(data, vertical=True) 
# error vertical不是每个函数都具有的 
 多维数据的kdeplot 
data = np.random.multivariate_normal([0, 0], [[1, 2], [2, 20]], size=1000)
data = pd.DataFrame(data, columns=["X", "Y"])
mpl.rc("figure", figsize=(6, 6))
sns.kdeplot(data)
# 更多的还是用来画二维数据的density plot
sns.kdeplot(data.X, data.Y, shade=True, bw="silverman", gridsize=50, clip=(-11, 11))
# gridsize参数用来指定grid尺寸
# cut clip 参数类似之前提到过的
# cmap则是用来color map映射, 相当于一个color小帽子(mask)
<matplotlib.axes._subplots.AxesSubplot at 0x2768f240> 
sns.kdeplot(data.X, data.Y, shade=True, bw="silverman", gridsize=50, clip=(-11, 11),  cmap="BuGn_d")
sns.kdeplot(data.X, data.Y, shade=True, bw="silverman", gridsize=50, clip=(-11, 11),  cmap="Purples")
 好了. 那再让我来回来想想jointplot 
 之前jointplot用了 kind=hex, 那么当见过了kde核函数分布图后, 可以把这二者结合到一起. 
with sns.axes_style('white'):
    sns.jointplot('X', 'Y', data, kind='kde') 
 hist增强版 - distplot 
# distplot 简版就是hist 加上一根density curve
sns.set_palette("hls")
mpl.rc("figure", figsize=(9, 5))
data = randn(200)
sns.distplot(data)
# 当然慢慢地就发现distplot的功能, 远比hist强大. 
sns.distplot(data, kde=True, rug=True, hist=True)
# 更细致的, 来用各kwargs来指定 (参数的参数dict)
sns.distplot(data, kde_kws={"color": "seagreen", "lw":3, "label" : "KDE" }, 
             hist_kws={"histtype": "stepfilled", "color": "slategray" })
 好了. 下面的图很熟悉, boxplot 与 violinplot 
 boxplot, 连续数据的另一种分布式描述. 以five - figures作为大概的集中趋势, 离散趋势的统计量.
 violinplot是与之类似, 它是在boxplot基础上增加了density curve (也就是"小提琴"的两侧曲线) 
  A violin plot is a method of plotting numeric data. It is a box plot with a rotated kernel density plot on each side.[1] 
 more info at wiki 
# first 先来看boxplot
sns.set(rc={"figure.figsize": (6, 6)})
data = [randn(100), randn(120) + 1.5]
plt.boxplot(data)
# 这是一个简单版"dataframe", 由两列不等长的series(array)组成, 没有index columns所以在图中默认用1,2,3代替
{'boxes': [<matplotlib.lines.Line2D at 0x25747908>,
  <matplotlib.lines.Line2D at 0x26995048>],
 'caps': [<matplotlib.lines.Line2D at 0x2574c6d8>,
  <matplotlib.lines.Line2D at 0x2574cc50>,
  <matplotlib.lines.Line2D at 0x26995d68>,
  <matplotlib.lines.Line2D at 0x2699f320>],
 'fliers': [<matplotlib.lines.Line2D at 0x2576e780>,
  <matplotlib.lines.Line2D at 0x2699fe10>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x2576e208>,
  <matplotlib.lines.Line2D at 0x2699f898>],
 'whiskers': [<matplotlib.lines.Line2D at 0x25747b38>,
  <matplotlib.lines.Line2D at 0x2574c160>,
  <matplotlib.lines.Line2D at 0x26995278>,
  <matplotlib.lines.Line2D at 0x269957f0>]} 
# 上面的图形是mpl module画出来的, 比较"ugly"
# 来看看seaborn画出来的样貌
sns.boxplot(data)
# ... 可能只是两种不同的风格吧!
<matplotlib.axes._subplots.AxesSubplot at 0x2673edd8> 
# join_rm 参数 rm 是指 repeated-measures data 重复观测
# 为了彰显重复观测的效应, 可使用join_rm参数==True
pre = randn(25)
post = pre+ np.random.rand(25)
sns.boxplot([pre, post], names=["left", "right"], color="coral", join_rm =True)
# 下面介绍violinplot, 而且是从boxplot开始讲起.
# 这也是非常喜欢这个module(作者)的原因, 很合我的味口
d1 = stats.norm(0, 5).rvs(100)
d2 = np.concatenate([stats.gamma(4).rvs(50), -1 * stats.gamma(4).rvs(50) ])
data = pd.DataFrame(dict(d1=d1, d2=d2))
sns.boxplot(data, color="pastel", widths=.5)
#    boxplot violinplot 常常用来 比较 一个分组(离散) X 一个连续变量的各组差异
# 因此若有DataFrame结构, 要尽量学着使用groupby操作.
y = np.random.randn(200)
g = np.random.choice(list('abcdef'), 200) 
for i, l in enumerate('abcdef'):
    y[g == l] += i // 2
df = pd.DataFrame(dict(score=y, group=g))
sns.boxplot(df.score, df.group)
<matplotlib.axes._subplots.AxesSubplot at 0x28feec88> 
# 关于names(组名称list), 默认的画图顺序是 array顺序, 也能额外用order参数指定
order = list('cbafed')
sns.boxplot(df.score, df.group, order=order, color='PuBuGn_d')
# 使用参数 inner
# inner : {‘box’ | ‘stick’ | ‘points’}
# Plot quartiles or individual sample values inside violin.
y = np.random.randn(200)
g = np.random.choice(list("abcdef"), 200)
for i, l in enumerate("abcdef"):
    y[g == l] += i // 2
df = pd.DataFrame(dict(score=y, group=g))
sns.boxplot(df.score, df.group); 
                    %matplotlib inlinePopulating the interactive namespace from numpy and matplotlibimport seaborn as snsimport numpy as npfrom numpy.random import randnimport matplotlib as mplimport matplotlib
在python中我们可以使用seaborn库来进行绘制：
Seaborn是一个基于matplotlib的Python数据可视化库。它为绘制有吸引力和信息丰富的统计图形提供了高级界面。
首先需要导入seaborn库：
import seaborn as sns
在seaborn中的distplot函数可以完成概率分布直方图和密度图的绘制
seaborn.distplot(a, bins=None, his.
				我们平时做数据分析的时候，经常要了解数据的分布情况，这时候就需要画出频率分布直方图，博主采用的画图工具是python中的seaborn，它的画图效果比matplotlib要好 [1]。
首先需要明确一下直方图和条形图的区别：条形图有空隙，直方图没有，条形图一般用于类别特征，直方图一般用于数字特征（连续型）[2]。
画直方图
def draw_distribution_histogram(nums, path, is_hist=True, is_kde=True, is_rug=False, \
np.random_integers(low, high, size)：返回随机的整数，位于闭区间 [low, high]
np.random.randint(low, high, size)：返回随机的整数，位于半开区间 [low, high)
a.value_counts()：统计a中各个离散值的频数,得到的也是series类型
import pandas as ...
1. 整体风格设置
对图表整体颜色、比例等进行风格设置，包括颜色色板等调用系统风格进行数据可视化
set() / set_style() / axes_style() / despine() / set_context()
import numpy as np
import pan...
				函数功能：判定数据(或特征)的分布情况
	调用方法：plt.hist(x, bins=10, range=None, normed=False, weights=None, cumulative=False, bottom=None, histtype='bar', align='mid', orientation='vertical', rwidth=None, log=False, color=None, label=None, stacked=False)
	参数说明：
	x：指定要绘制直方图的数据；
				pykcsd
 核电流源密度是最近开发的一种估计跨膜电流源密度的方法，可用于神经元突触动力学的详细研究。
 它可以根据不规则放置的线性、平面和空间电极测量的电位来估计电流源密度。 
 免费软件：BSD 许可证
文档：  :  。
静态和动态记录的 1D、2D、3D 情况下的电位和 CSD 估计
估计数量的可视化
大数据集管理
图形用户界面
fig, ax = plt.subplots()
ax.plot(x, pdf, label='Exponential PDF')
ax.hist(samples, density=True, alpha=0.5, label='Random Samples')
ax.legend()
plt.show()
解释一下上面的代码：
1. 导入必要的库：`numpy`、`matplotlib.pyplot`和`scipy.stats.expon`。
2. 使用`expon`函数创建一个指数分布随机变量`rv`，其中`loc`参数设置分布的起始点，`scale`参数设置分布的尺度。
3. 使用`rv.rvs`方法生成100个随机样本。
4. 使用`rv.pdf`方法计算概率密度函数，并使用`np.linspace`生成一组等间隔的$x$值。
5. 使用`matplotlib`绘制概率密度曲线和随机样本直方图，其中`density=True`表示绘制的是概率密度直方图，`alpha=0.5`表示直方图透明度为0.5。
6. 使用`plt.show`显示图形。
如果你想绘制不同参数下的指数分布概率密度曲线，可以修改代码中的`loc`和`scale`参数。