numpy与pandas各种功能及其对比(超全)_numpy和pandas的区别_阿丢是丢心心的博客-CSDN博客

一、简介

1、numpy的核心数据结构是ndarray

支持任意维数的数组,但要求单个数组内所有数据是同质的,即类型必须相同;pandas是基于numpy数组构建的,它的核心数据结构是series和dataframe,仅支持一维和二维数据,但数据内部可以是异构数据,仅要求同列数据类型一致即可。

numpy的数据结构仅支持数字索引,而pandas数据结构则同时支持数字索引和标签索引。

2、numpy用于数值计算,pandas主要用于数据处理与分析。

numpy虽然也支持字符串等其他数据类型,但仍然主要是用于数值计算,尤其是内部集成了大量矩阵计算模块,例如基本的矩阵运算、线性代数、fft、生成随机数等,支持灵活的广播机制。

pandas主要用于数据处理与分析,支持包括数据读写、数值计算、数据处理、数据分析和数据可视化全套流程操作。

二、基础属性

1.数据结构
在开始创建数据之前,再说明一下两种方法创建的数据结构形式

numpy:是通用的同构数据多维容器,其中的所有元素必须是相同类型的

 pandas:一维数据结构为series,多维是dataframe

 其中:

Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成,索引在左边,值在右边。

 DataFrame是一个表格型的数据结构,既有行索引(index)也有列索引(columns),它可以被看做由Series组成的字典(共用同一个索引)。

每列可以是不同的值类型(数值、字符串、布尔值等)。

 2、对象常用属性


(1)ndarray

data. shape 数组的形状
data.ndim 维度的数量
data.size  数组的总大小
data.dtype 数组中的数据类型(int8, uint8, float32, complex64, bool, object, string.
data.astype 数据类型转换


(2)Series

obj.values 获取数组的值
obj.index 获取数组的索引对象
obj.name 获取数组的名称
obj.index.name 获取索引的名称


(3) DateFrame

frame.index 获取表格的行索引
frame.columns 获取表格的列索引
frame.values 获取表格中的数据
frame.col_name 获取指定列的数据
frame.index.name 获取索引的名称
frame.columns.name 获取列的名称

重新命名列的三种方法

第一种方法:重新命名指定的列

df.rename(columns = {'环湖医院':'开滦医院', '普通医院':'三甲医院'}, inplace = True) 

第二种方法:修改全部列名

df.columns = ['舒畅', '小舒畅', '舒小畅', '舒畅小'] 

第三种方法:修改列名的一部分

df.columns = df.columns.str.replace('环湖医院', '开滦医院')

(由于输出比较长,部分地方就只放了代码,没有放运行结果哈)

三、 一维数据(numpy和pandas.Series对比)

1、新增Create 

#列表转化
list1 = [4, 7, -5, 3]
arr1 = np.array(list1)
s1 = pd.Series(list1)

arr1
s1 

输出:
array([ 4,  7, -5,  3])

0    4
1    7
2   -5
3    3
dtype: int64

 
## 直接创建
arr2 = np.array([4, 7, -5, 3]) 
obj = Series([4, 7, -5, 3], dtype=np.int32)
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

arr2 
obj 
obj2

输出:
array([ 4,  7, -5,  3])

0    4
1    7
2   -5
3    3
dtype: int32

d    4
b    7
a   -5
c    3
dtype: int64
 #字典转换
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

输出:

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64



states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

输出:

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

#创建连续数值数组
arr3 = np.arange(5)
s3 = pd.Series(np.arange(5))
s3_1 = pd.Series(range(5))  #输出与s3相同
 
#创建全为0的数组
arr4 = np.zeros(5)
s4 = pd.Series(0 ,index = ['a','b','c','d'])
s4_1 = pd.Series(np.zeros(5))  
 
#创建随机数数组
arr5 = np.random.rand(5)
arr_1 = np.random.random_sample(size=5)
s5 = pd.Series(np.random.rand(5))

2、查找Read

obj2

输出:
d    4
b    7
a   -5
c    3
dtype: int64



obj2[['c', 'a', 'd']]

输出:
c    3
a   -5
d    4
dtype: int64


obj2[1:3]    # [1, 3)

输出:
b    7
a   -5
dtype: int64


obj2['b':'c']    # []

输出:
b    7
a   -5
c    3
dtype: int64



obj

输出:
0    4
1    7
2   -5
3    3
dtype: int32


obj[1:3]

输出:
1    7
2   -5
dtype: int32


obj.index

输出:
RangeIndex(start=0, stop=4, step=1)


obj2.index

输出:
Index(['d', 'b', 'a', 'c'], dtype='object')



obj2.values

输出:
array([ 4,  7, -5,  3])


obj[1], obj2['a']

输出:
(7, -5)

3、修改Update

obj2

输出为:
d    4
b    7
a   -5
c    3
dtype: int64



obj2['d'] = 6
obj2

输出为:
d    6
b    7
a   -5
c    3
dtype: int64


obj2[['a', 'b', 'c']] = [1, 2, 3]
obj2

输出为:
d    6
b    2
a    1
c    3
dtype: int64




obj4

输出为:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64



obj4.name = 'population'
obj4.index.name = 'state'
obj4

输出为:
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64




obj

输出为:
0    4
1    7
2   -5
3    3
dtype: int32


obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

输出为:
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int32



4、删除 Delete

obj2

输出为:
d    4
b    7
a   -5
c    3
dtype: int64


del obj2['a']

输出为:
此次无输出


obj2.drop('c')    # drop操作返回新的对象,原对象不变

输出为:
d    4
b    7
dtype: int64


obj2

输出为:
d    4
b    7
c    3
dtype: int64


obj2.drop(['b', 'd'])

输出为:
c    3
dtype: int64


obj2

输出为:
d    4
b    7
c    3
dtype: int64

5、运算

5.1、标量运算

数组与标量的算术运算会将标量值传播到各个元素

不同大小的数组之间的算术运算的执行方式叫做广播(broadcasting)

广播中较小数组的‘广播维’必须为1, 广播会在缺失和(或)长度为1的维度上进行。较小的数组会在较大的数组上沿着该维度广播,并在经过的地方做相应的算术运算。最简单的广播就是标量值跟数组合并的运算。

obj2

输出为:
d    4
b    7
a   -5
c    3
dtype: int64



obj2 * 2

输出为:
d     8
b    14
a   -10
c     6
dtype: int64



obj2 > 4

输出为:
d    False
b     True
a    False
c    False
dtype: bool



obj2 / 2

输出为:
d    2.0
b    3.5
a   -2.5
c    1.5
dtype: float64



obj2[obj2 > 4]

输出为:
b    7
dtype: int64



'b' in obj2

输出为:
True


'e' in obj2

输出为:
False

5.2、向量运算

# 根据运算的索引标签自动对齐数据,对不同索引的对象进行算术运算。
obj3

输出为:
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64



obj4

输出为:
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64



obj3 + obj4

输出为:
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64



obj3 * obj4

输出为:
California             NaN
Ohio          1.225000e+09
Oregon        2.560000e+08
Texas         5.041000e+09
Utah                   NaN
dtype: float64



obj3 > obj4  # 会报错,无法匹配



ser1 = pd.Series(np.arange(1, 6), index = list('abcde'))
ser2 = pd.Series(np.arange(2, 7), index = list('abcde'))
ser1 > ser2

输出为:
a    False
b    False
c    False
d    False
e    False
dtype: bool



5.3、函数运算

(1)通用函数


通用函数(ufunc): 一种对ndarray中的数据执行元素级运算的函数。对pandas中的Series和DataFrame也适用。

一元ufunc: abs, sqrt, square, exp, log, sign, isnan, ceil, floor, rint, modf, isfinit, isinf, cos, cosh, sin, sinh, tan, tanh,....

用法:np.abs(arr)

二元ufunc:add, substract, multiply, divide, floor_divide, power, maximum, minimum, mod, copysign, greater, greater_equal, less, less_equal, equal, not_equal, logical_and, logical_or

用法:np.add(arr1, arr2) 

(2)numpy


ndarray对象的基本数学和统计方法函数

sum, mean, std, var, min, max, argmin, argmax, cumsum, cumprod
 

arr = np.random.randn(3,4)
print(arr)
print(arr.mean())
print(arr.sum())
print(arr.mean(axis = 1))
print(arr.sum(axis = 0))


输出为:
[[ 0.32032819  1.26761265 -2.39961199  0.97646207]
 [-0.23177368 -0.31558693 -0.82149254 -1.63021035]
 [-0.84288386 -0.01973655 -0.96049732  1.73198541]]
-0.2437837416655606
-2.9254048999867273
[ 0.04119773 -0.74976587 -0.02278308]
[-0.75432935  0.93228917 -4.18160185  1.07823713]
(3)pandas 


pandas对象拥有的数学和统计方法:count, describe, min, max, argmin, argmax, idxmin, idxmax, quantile, sum, mean, median, mad, var, std, skew, kurt, cumsum, cummin, cummax, cumprod, diff, pct_change
 

obj2

输出为:
d    4
b    7
a   -5
c    3
dtype: int64



np.exp(obj2)

输出为:
d    403.428793
b      7.389056
c     20.085537
dtype: float64



obj4

输出为:
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64



pd.isnull(obj4)

输出为:
state
California     True
Ohio          False
Oregon        False
Texas         False
Name: population, dtype: bool



pd.notnull(obj4)

输出为:
state
California    False
Ohio           True
Oregon         True
Texas          True
Name: population, dtype: bool



obj4.isnull()

输出为:
state
California     True
Ohio          False
Oregon        False
Texas         False
Name: population, dtype: bool



obj4.notnull()

输出为:
state
California    False
Ohio           True
Oregon         True
Texas          True
Name: population, dtype: bool


#顺便对比一下numpy
arr = np.array([[1, 2, 3, 4, 5], [6, 4, 3, 2, 1]])
print(arr)
print(arr.argmax(axis=0))

输出为:
[[1 2 3 4 5]
 [6 4 3 2 1]]
[1 1 0 0 0]






ser = pd.Series([1,3,2,5,7])
print(ser)
print(ser.argmax)

输出为:
0    1
1    3
2    2
3    5
4    7
dtype: int64
<bound method IndexOpsMixin.argmax of 0    1
1    3
2    2
3    5
4    7
dtype: int64>

四、多维数据

1、新增Create

numpy

arr1 =np.array([[1, 2, 3, 4], 
            [5, 6, 7, 8]])

输出为:
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])



s0 = pd.DataFrame([[1, 2, 3, 4], 
            [5, 6, 7, 8]])

输出为:
	0	1	2	3
0	1	2	3	4
1	5	6	7	8


s1 = pd.DataFrame(np.arange(16).reshape((4, 4)))

输出为:

    0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

我们经常用字典的方式来创建series和dataframe

#外层字典的键作为列索引,dataframe自动加上行索引
data = {'one': [0, 1],
        'two': [2, 3]}
s3 = pd.DataFrame(data)


输出为:
   one	two
0	0	2
1	1	3

 
#对于嵌套字典,外层字典作为列索引,内层键作为行索引
data = {'one': {'a':0, 'b':1},
        'two': {'a':2, 'b':3}}
s4 = pd.DataFrame(data)

输出为:

   one	two
a	0	2
b	1	3

还可以自定义列和索引,并且依据索引对改变数据顺序

#自定义列和索引标签
s4 = pd.DataFrame(data, columns=['one', 'two', 'three', 'four'])

输出为:

    one	two	three	four
a	0	2	NaN	    NaN
b	1	3	NaN  	NaN


s5 = pd.DataFrame(data, columns=['three', 'two', 'one', 'four'])
输出为:


    three	two	one	four
a	NaN	    2	0	NaN
b	NaN	    3	1	NaN

综合上述重举例

data = {
            'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
       }
frame = DataFrame(data)
frame

输出为:

    state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9



frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
frame2

输出为:

        year	state	pop	  debt
one	    2000	Ohio	1.5	  NaN
two	    2001	Ohio	1.7	  NaN
three	2002	Ohio	3.6	  NaN
four	2001	Nevada	2.4   NaN
five	2002	Nevada	2.9	  NaN


obj = Series([4, 7, -5, 3, 10], index=['d', 'b', 'a', 'c', 'e'])
obj2 = Series([8, 1, 2, -21], index=['d', 'b', 'a', 'c'])
DataFrame({'col1': obj, 'col2': obj2})

输出为:

    col1	col2
a	-5  	2.0
b	7   	1.0
c	3   	-21.0
d	4   	8.0
e	10   	NaN



frame3 = DataFrame({'year': {'one': 2000, 'two': 2001, 'three': 2002, 'four': 2001, 'five': 2002}, 
                    'state': {'one': 'Ohio', 'two': 'Ohio', 'three': 'Ohio', 'four': 'Nevada', 'five': 'Nevada'},
                    'pop': {'one': 1.5, 'two': 1.7, 'three': 3.6, 'four': 2.4, 'five': 2.9}})

输出为:


        year	state	pop
one 	2000	Ohio	1.5
two 	2001	Ohio	1.7
three	2002	Ohio	3.6
four	2001	Nevada	2.4
five	2002	Nevada	2.9

2、查找Read

frame2

输出为:
        year	state	pop	debt
one	    2000	Ohio	1.5	NaN
two	    2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN


frame2.values

输出为:
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan]], dtype=object)


frame2.shape

输出为:
(5, 4)



frame2.index

输出为:
Index(['one', 'two', 'three', 'four', 'five'], dtype='object')



frame2.columns

输出为:
Index(['year', 'state', 'pop', 'debt'], dtype='object')

frame2['state']    # 获取一列,得到一个Series

输出为:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object




frame2.state    # 获取一列,得到一个Series

输出为:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object



frame2[['state', 'debt']]    # 获取多列,得到一个DataFrame

输出为:
        state	debt
one	    Ohio	NaN
two	    Ohio	NaN
three	Ohio	NaN
four	Nevada	NaN
five	Nevada	NaN



frame2['one':'one']    # 获取行,必须使用切片的方式 []

输出为:
    year	state	pop	debt
one	2000	Ohio	1.5	NaN


frame2[0:4]    # 前4行,容易混淆   [)

输出为:
        year	state	pop	debt
one	    2000	Ohio	1.5	NaN
two	    2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN




frame2.reindex(['two', 'one'])    # 选出特定的行,按照参数的顺序

输出为:
    year	state	pop	debt
two	2001	Ohio	1.7	NaN
one	2000	Ohio	1.5	NaN



frame2.loc['one':'three', ['state', 'debt']]    # loc使用index名字,第一个参数索引行,第二个参数索引列

输出为:
	    state	debt
one	    Ohio	NaN
two	    Ohio	NaN
three	Ohio	NaN




frame2.loc[:, ['state', 'debt']]    # 所有行

输出为:
	    state	debt
one	    Ohio	NaN
two	    Ohio	NaN
three	Ohio	NaN
four	Nevada	NaN
five	Nevada	NaN




frame2.loc['one':'one', ['state', 'debt']]    # 一行,得到DataFrame

输出为:
	state	debt
one	Ohio	NaN




frame2.loc['one', ['state', 'debt']]    # 一行,得到Series

输出为:
state    Ohio
debt      NaN
Name: one, dtype: object



frame2

输出为:
	    year	state	pop	debt
one	    2000	Ohio	1.5	NaN
two	    2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN




frame2.iloc[0:3, [1, 3]]    # iloc使用绝对位置(从0开始),第一个参数索引行,第二个参数索引列, 推荐使用iloc

输出为:
	    state	debt
one	    Ohio	NaN
two	    Ohio	NaN
three	Ohio	NaN




frame2.iloc[:, [1,3]]    # 所有行

输出为:
	    state	debt
one	    Ohio	NaN
two	    Ohio	NaN
three	Ohio	NaN
four	Nevada	NaN
five	Nevada	NaN




frame2.iloc[0:1, [1, 3]]    # 一行,得到DataFrame

输出为:
	state	debt
one	Ohio	NaN



frame2.iloc[0, [1, 3]]    # 一行,得到Series

输出为:
state    Ohio
debt      NaN
Name: one, dtype: object




3、修改Update

frame2

输出为:
	    year	state	pop	debt
one	    2000	Ohio	1.5	NaN
two	    2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN



frame2['debt'] = 16.5
frame2

输出为:
	    year	state	pop	debt
one	    2000	Ohio	1.5	16.5
two	    2001	Ohio	1.7	16.5
three	2002	Ohio	3.6	16.5
four	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5



frame2['debt'] = np.arange(5)
frame2

输出为:
        year	state	pop	debt
one	    2000	Ohio	1.5	0
two	    2001	Ohio	1.7	1
three	2002	Ohio	3.6	2
four	2001	Nevada	2.4	3
five	2002	Nevada	2.9	4



val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

输出为:
	    year	state	pop	debt
one	    2000	Ohio	1.5	NaN
two	    2001	Ohio	1.7	-1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	-1.5
five	2002	Nevada	2.9	-1.7




frame2.name = 'data frame'
frame2.columns.name = 'columns'
frame2.index.name = 'index'
frame2

输出为:
columns	year	state	pop	debt
index				
one  	2000	Ohio	1.5	NaN
two 	2001	Ohio	1.7	-1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	-1.5
five	2002	Nevada	2.9	-1.7




frame2.columns = ['a', 'b', 'c', 'd']
frame2

输出为:
	    a	    b	    c	d
index				
one 	2000	Ohio	1.5	NaN
two 	2001	Ohio	1.7	-1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	-1.5
five	2002	Nevada	2.9	-1.7



frame2.index = [1, 2, 3, 4, 5]
frame2

输出为:
    a	    b	    c	d
1	2000	Ohio	1.5	NaN
2	2001	Ohio	1.7	-1.2
3	2002	Ohio	3.6	NaN
4	2001	Nevada	2.4	-1.5
5	2002	Nevada	2.9	-1.7

4、删除Delete

frame2

输出为:
    a	    b	    c	d
1	2000	Ohio	1.5	NaN
2	2001	Ohio	1.7	-1.2
3	2002	Ohio	3.6	NaN
4	2001	Nevada	2.4	-1.5
5	2002	Nevada	2.9	-1.7



frame2.drop([1, 2], inplace=True, axis=0)    # 删除行,inplace参数控制是否真正删除
frame2

输出为:
	a	    b	    c	d
3	2002	Ohio	3.6	NaN
4	2001	Nevada	2.4	-1.5
5	2002	Nevada	2.9	-1.7



frame2.drop(['a', 'b'], axis=1)    # 删除列

输出为:
c	d
3	3.6	NaN
4	2.4	-1.5
5	2.9	-1.7


frame2.pop('c')    # 真正删除
frame2

输出为:
    a	    b	    d
3	2002	Ohio	NaN
4	2001	Nevada	-1.5
5	2002	Nevada	-1.7




del frame2['a']
frame2

输出为:
    b	    d
3	Ohio	NaN
4	Nevada	-1.5
5	Nevada	-1.7

5、运算

5.1、标量运算

frame3

输出为:
	    year	state	pop
five	2002	Nevada	2.9
four	2001	Nevada	2.4
one	    2000	Ohio	1.5
three	2002	Ohio	3.6
two	    2001	Ohio	1.7



frame3['pop'] > 2

输出为:
five      True
four      True
one      False
three     True
two      False
Name: pop, dtype: bool




frame3[frame3['pop'] > 2]

输出为:
year	state	pop
five	2002	Nevada	2.9
four	2001	Nevada	2.4
three	2002	Ohio	3.6


frame

输出为:
	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9



frame['pop']

输出为:
0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
Name: pop, dtype: float64



# Dataframe
df = pd.DataFrame(np.arange(9.).reshape((3,3)),
                 index = list('abc'),
                 columns = ['one','two','three'])

print(df)

输出为:
 one  two  three
a  0.0  1.0    2.0
b  3.0  4.0    5.0
c  6.0  7.0    8.0


print(df + 1)

输出为:
   one  two  three
a  1.0  2.0    3.0
b  4.0  5.0    6.0
c  7.0  8.0    9.0


print(np.add(df, 1))  #dataframe基于numpy数组进行计算,我们可以直接使用np的运算方法

输出为:
   one  two  three
a  1.0  2.0    3.0
b  4.0  5.0    6.0
c  7.0  8.0    9.0

print(df ** 2)

输出为:
    one   two  three
a   0.0   1.0    4.0
b   9.0  16.0   25.0
c  36.0  49.0   64.0

print((df > 2) & (df < 6))

输出为:
     one    two  three
a  False  False  False
b   True   True   True
c  False  False  False

5.2、向量运算

df = pd.DataFrame(np.arange(9.).reshape((3,3)),
                 index = list('abc'),
                 columns = ['one','two','three'])
print(df)

输出为:
   one  two  three
a  0.0  1.0    2.0
b  3.0  4.0    5.0
c  6.0  7.0    8.0


#当从DataFrame减去相同索引的Series(df.iloc[0]),每一行都会执行这个操作。这就叫做广播(broadcasting)
ser = df.iloc[0]
print(ser)
print(df-ser)

输出为:
one      0.0
two      1.0
three    2.0
Name: a, dtype: float64

   one  two  three
a  0.0  0.0    0.0
b  3.0  3.0    3.0
c  6.0  6.0    6.0


 
#如果希望匹配行且在列上广播,则必须使用DataFrame的算术运算方法。
#传入的轴号就是希望匹配的轴,行索引(axis=‘index’ or axis=0)或列索引(axis=‘column’ or axis=1)。
ser2 = df.iloc[:,1]
print(ser2)
df.sub(ser2, axis = 'index')
 
输出为:
a    1.0
b    4.0
c    7.0
Name: two, dtype: float64

	one	    two	three
a	-1.0	0.0	1.0
b	-1.0	0.0	1.0
c	-1.0	0.0	1.0


 
#如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集
ser3 = pd.Series(range(3), index = ['one', 'two','four'])
print(ser3)
print(df - ser3)
 
输出为:
one     0
two     1
four    2
dtype: int64

   four  one  three  two
a   NaN  0.0    NaN  0.0
b   NaN  3.0    NaN  3.0
c   NaN  6.0    NaN  6.0

5.3、函数运算

frame['pop'].apply(lambda x: x**3)    # 对df的某一列应用一个函数,生成新的Series

输出为:
0     3.375
1     4.913
2    46.656
3    13.824
4    24.389
Name: pop, dtype: float64

六、文件读写I/O

df = pd.read_csv('E:\\data\\ad_feature.csv', sep=',', header=0)    # 读入数据
df.shape

输出为:
(846811, 6)


df.columns

输出为:
Index(['adgroup_id', 'cate_id', 'campaign_id', 'customer', 'brand', 'price'], dtype='object')


df.head(10)

输出为:
    adgroup_id	cate_id	campaign_id	customer	brand	    price
0	63133	    6406	83237	    1	        95471.0	    170.00
1	313401	    6406	83237	    1	        87331.0	    199.00
2	248909	    392	    83237	    1	        32233.0	    38.00
3	208458	    392	    83237	    1	        174374.0	139.00
4	110847	    7211	135256	    2	        145952.0	32.99
5	607788	    6261	387991	    6	        207800.0	199.00
6	375706	    4520	387991	    6	        NaN	        99.00
7	11115	    7213	139747	    9	        186847.0	33.00
8	24484	    7207	139744	    9	        186847.0	19.00
9	28589	    5953	395195	    13	        NaN	        428.00


df[:100].to_csv('E:\\data\\output.csv', index=False)    # 输出成CSV格式,索引(index)不要输出

df[:100].to_html('E:\\data\\output.html', index=False)    # 输出成网页table

writer = pd.ExcelWriter('E:\\data\\output.xlsx')    # 输出到Excel文件
df[:100].to_excel(writer, 'Sheet1', index=False)
df[100:200].to_excel(writer, 'Sheet2', index=False)
writer.save()

1、基本操作 Essential Basic Functionality

df.head(10)    # 前十行

adgroup_id cate_id campaign_id customer brand price
0 63133 6406 83237 1 95471.0 170.00
1 313401 6406 83237 1 87331.0 199.00
2 248909 392 83237 1 32233.0 38.00
3 208458 392 83237 1 174374.0 139.00
4 110847 7211 135256 2 145952.0 32.99
5 607788 6261 387991 6 207800.0 199.00
6 375706 4520 387991 6 NaN 99.00
7 11115 7213 139747 9 186847.0 33.00
8 24484 7207 139744 9 186847.0 19.00
9 28589 5953 395195 13 NaN 428.00

df.tail(10)    # 倒数十行

adgroup_id cate_id campaign_id customer brand price
846801 841089 5271 378762 255848 NaN 75.0
846802 828175 8116 374639 255851 430861.0 376.0
846803 831181 7600 377541 255862 122355.0 260.0
846804 762452 4284 379172 255865 NaN 128.0
846805 762200 4284 379172 255865 NaN 169.0
846806 824255 4526 380022 255868 389713.0 268.0
846807 790170 4280 379736 255872 322171.0 68.0
846808 845286 6261 379736 255872 322171.0 88.0
846809 824732 4520 379736 255872 322171.0 68.0
846810 845337 11156 379603 255874 74120.0 279.0

df['price'].describe()    # 一列的统计信息,只对于可以比较大小的值有意义

count    8.468110e+05
mean     1.838867e+03
std      3.108877e+05
min      1.000000e-02
25%      4.900000e+01
50%      1.390000e+02
75%      3.520000e+02
max      1.000000e+08
Name: price, dtype: float64


df['cate_id'].describe()    # 类别,无意义

count    846811.000000
mean       5868.593464
std        2705.171203
min           1.000000
25%        4383.000000
50%        6183.000000
75%        7047.000000
max       12960.000000
Name: cate_id, dtype: float64



# 迭代 Iteration, 一般用不到, 用于精细调整数据
for row in df[:10].itertuples():
    print(row.cate_id, row.price)

6406 170.0
6406 199.0
392 38.0
392 139.0
7211 32.99
6261 199.0
4520 99.0
7213 33.0
7207 19.0
5953 428.0


# 排序 Sort

df[:10].sort_values(by='price')    # 按照价格从高到低排序

adgroup_id cate_id campaign_id customer brand price
8 24484 7207 139744 9 186847.0 19.00
4 110847 7211 135256 2 145952.0 32.99
7 11115 7213 139747 9 186847.0 33.00
2 248909 392 83237 1 32233.0 38.00
6 375706 4520 387991 6 NaN 99.00
3 208458 392 83237 1 174374.0 139.00
0 63133 6406 83237 1 95471.0 170.00
1 313401 6406 83237 1 87331.0 199.00
5 607788 6261 387991 6 207800.0 199.00
9 28589 5953 395195 13 NaN 428.00

df[:10].sort_values(by=['price', 'brand'], ascending=False)     # 可以按照多个key进行排序

adgroup_id cate_id campaign_id customer brand price
9 28589 5953 395195 13 NaN 428.00
5 607788 6261 387991 6 207800.0 199.00
1 313401 6406 83237 1 87331.0 199.00
0 63133 6406 83237 1 95471.0 170.00
3 208458 392 83237 1 174374.0 139.00
6 375706 4520 387991 6 NaN 99.00
2 248909 392 83237 1 32233.0 38.00
7 11115 7213 139747 9 186847.0 33.00
4 110847 7211 135256 2 145952.0 32.99
8 24484 7207 139744 9 186847.0 19.00

df[(df['price'] > 1000) & (df['price'] <= 1005)]    # 按条件选取

adgroup_id cate_id campaign_id customer brand price
15532 162225 438 195767 10304 27548.0 1003.20
15629 198226 1137 132021 10304 27548.0 1003.95
63708 493633 6529 24297 43846 65862.0 1005.00
95975 70894 6143 415904 70491 57921.0 1001.00
141950 146094 6422 202221 110338 383342.0 1001.00
170936 56930 438 63192 135753 236763.0 1001.00
249594 131736 7548 239667 218706 NaN 1004.00
269105 731301 6300 321125 241392 449625.0 1001.00
381495 95726 438 15319 69075 198458.0 1004.00
387720 520899 10061 196687 74656 121466.0 1004.00
387729 217219 10061 196687 74656 121466.0 1004.00
426948 740358 685 37953 111752 NaN 1001.00
446511 356429 6261 244039 129461 291007.0 1001.00
466898 502035 6261 76193 146324 348841.0 1003.00
644946 778184 4282 64922 55566 NaN 1001.00
655803 279547 4290 409222 64015 109551.0 1002.00
808370 202848 1535 211527 208644 33413.0 1001.00

2、数据分组 Groupby

df = pd.read_csv('E:\\data\\ad_feature.csv', sep=',', header=0, nrows=1000)
cate_group = df.groupby(['cate_id', 'customer'])    # 按照cate_id,customer进行分组,把cate_id,customer相同的数据分成一组
type(cate_group)

pandas.core.groupby.generic.DataFrameGroupBy



cate_group.size()

cate_id  customer
12       63          3
         369         1
45       340         3
122      225         3
         658         2
                    ..
11932    196         1
12115    516         1
12159    112         1
12390    130         3
12845    323         1
Length: 472, dtype: int64



cate_group['price', 'brand'].mean().head()   # 计算每一组price和brand的平均值,必须保证函数在每个组中产生一个值


	        price	    brand
cate_id		
12	        374.000000	97165.0
45	        62.666667	141605.0
122	        28.800000	386513.0
124	        89.000000	100353.0
133	        2.000000	NaN




df.groupby('cate_id')['brand'].mean().reset_index(drop=False).head()    # reset_index会重新设置index,drop参数控制是否丢弃原来的index



    cate_id	    brand
0	12	        97165.0
1	45	        141605.0
2	122	        386513.0
3	124	        100353.0
4	133	        NaN


3、数据合并(Merge, Concatenate)

df = pd.read_csv('E:\\data\\ad_feature.csv', sep=',', header=0)
df1 = df[['adgroup_id', 'price']]
df2 = df[['adgroup_id', 'cate_id', 'brand']]

df1.head()

输出为:
    adgroup_id	price
0	63133	    170.00
1	313401	    199.00
2	248909	    38.00
3	208458	    139.00
4	110847	    32.99



df2.head()

输出为:
    adgroup_id	cate_id	brand
0	63133	    6406	95471.0
1	313401	    6406	87331.0
2	248909	    392	    32233.0
3	208458	    392	    174374.0
4	110847	    7211	145952.0


df3 = pd.merge(df1, df2, on='adgroup_id', how='inner')     # 把adgroup_id相同的数据拼接
df3.head()

输出为:
    adgroup_id	price	cate_id	brand
0	63133	    170.00	6406	95471.0
1	313401	    199.00	6406	87331.0
2	248909	    38.00	392	    32233.0
3	208458	    139.00	392	    174374.0
4	110847	    32.99	7211	145952.0


pd.concat([df[:4], df[10:15]], axis=0)    # 按照axis轴进行拼接,axis=0在行上拼接,axis=1在列上拼接

输出为:
	adgroup_id	cate_id	campaign_id	customer	brand	    price
0	63133	    6406	83237	    1	        95471.0	    170.0
1	313401	    6406	83237	    1	        87331.0	    199.0
2	248909	    392	    83237	    1	        32233.0	    38.0
3	208458	    392	    83237	    1	        174374.0	139.0
10	23236	    5953	395195	    13	        NaN	        368.0
11	300556	    5953	395195	    13	        NaN	        639.0
12	92560	    5953	395195	    13	        NaN	        368.0
13	590965	    4284	28145	    14	        454237.0	249.0
14	529913	    4284	70206	    14	        NaN	        249.0


pd.concat([df[:4], df[10:15]], axis=1)    # 把index相同的拼接成一行

输出为:

df6 = pd.concat([df[:4], df[10:15].reset_index(drop=True)], axis=1)   去除原index
df6

输出为:

df6.columns

输出为:
Index(['adgroup_id', 'cate_id', 'campaign_id', 'customer', 'brand', 'price',
       'adgroup_id', 'cate_id', 'campaign_id', 'customer', 'brand', 'price'],
      dtype='object')


df6['cate_id']


输出为:
    cate_id	cate_id
0	6406.0	5953
1	6406.0	5953
2	392.0	5953
3	392.0	4284
4	NaN	    4284

4、缺失值(Missing Value)处理

df.shape

输出为:
(846811, 6)


df.dropna().shape    # 丢弃NaN值

输出为:
(600481, 6)
df.fillna(0).head(10)    # 将NaN填充成固定值

输出为:

df.fillna(df['brand'].mean()).head(10)    # 使用brand均值填充

输出为:

df.fillna(method='ffill').head(10)    # 将NaN填充成前一个值

输出为:

5、异常值 / 离群值(Outlier Data)处理

price_des = df['price'].describe()
price_des

输出为:
count    8.468110e+05
mean     1.838933e+03
std      3.109008e+05
min      1.000000e-02
25%      4.900000e+01
50%      1.390000e+02
75%      3.520000e+02
max      1.000000e+08
Name: price, dtype: float64


valid_max = price_des['50%'] + 3 * (price_des['75%'] - price_des['50%'])
valid_min = price_des['50%'] - 3 * (price_des['50%'] - price_des['25%'])
df['price'] = df['price'].clip(valid_min, valid_max)    # 把异常值变为边界值
df['price'].describe()

输出为:
count    846811.000000
mean        243.996387
std         253.800315
min           0.010000
25%          49.000000
50%         139.000000
75%         352.000000
max         778.000000
Name: price, dtype: float64

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐