pandas常用操作.pdfpandas库的常用操作，参考书籍《Pandas Cookbook》，内

文件名称: pandas常用操作.pdf

所属分类: Python

开发工具:

文件大小: 685kb

下载次数: 0

上传时间: 2019-08-31

提供者: just****

下载 (685kb)

不能下载？报告错误

详细说明：pandas库的常用操作，参考书籍《Pandas Cookbook》，内容干货，推荐下载！movie get_dtype_counts# output the number of columns with each specific data type: movie. select_dtypes(include['int ]).head(# select only integer columns movie. filter(1ike=' facebook').head()#1ike参数表示包含此字符串 movie. filter( regex="\d").head()# movIe. Filter支持正则表达式 movie filter(items=[ 'actor_l_name,asdf ']#which takes a list of exact co l umn names 在整个 Dataframe上操作 movie shape movie. counto move.min(#各列的最小值 movie.iSnu11(.anyO).any()#判断整个 Dataframe有没有缺失值,方法是连着使用两个any (co1lege_ugds_head()+.00501)//.01#列名参与代数运算,等价于其中每一个元素参与此运算 #统计缺失值最主要方法是使用讠snu11方法 college_ugds_.isnu110. sumO college_ugds_ cumsum. sort values("UGDs_HISP', ascending= False)#按照某一列排序 co1 lege_gds. drona(hoW='a11")#如果所有列都是缺失值,则将其去除数据分析 1查看数据 college= pd. read_csv(data/college. csv) college head college. shape display(col1ege. describe( include=[np. number]).T)#统计数值列,并讲行转置,很推荐!!! 选择数据子集 n directly after a Series or DataFrame The iloc indexer selects only by integer location and works similarly to Python lists The. loc indexer selects only by index label, which is similar to how Python dictionaries work. 行、列、均可以 co1ege.iloc[:,[4,6]].head()#选取两列的所有的行 college. loc[:,[WOMENONLY',SATVRMIDJ college.iloc[[60,99,3]]. index. tolist#, index, tolist(可以直接提取索引标签,生成一个列表 co1lege.i1oc[5,-4]#数索引 college.loc[' The University of Alabama"'," PCTFLOAN']#标签索引 co1lege[10:20:2]#逐行读取。 city= college['CITY'] city[10:20:2]# Series也可以进行同样的切片布尔索引布尔索引也叫布尔选择,通过提供boo值来选择行, These boolean values are usually stored in a Series,不同条件可以进行与或非,&,|,~但请注意, python中,位运算符的优先级高于比较运算符,所以需要加括号 criterial- movie. imdb score 8 criterial= moviecontent_rating ==PG-13 criteria3=( movie. title year<2000)( move.tit1 e year>=2010)#号不能少 final=criterial& criteria criteria college[fina1]#作为索引,直接选择值为True的行。 employee. BASE SALARY. between(80000,120000)#用 between来选择 #排除最常出现的5家单位 criteria= employee DEPARTMENT. isin(top_5_depts) employee[criteria]. head 条件复杂时,采用 dataframe的 query方法 df query('A>B)# qei valent to df[df. A>df.B] #读取 employee数据,确定选取的部门和列 employee= pd. read_csv( data/emp loyee. csv ') depts =['Houston Police Department-HPD, Houston Fire Department (HFD)'] select_columns =[ UNIQUE_ID,DEPARTMENT,GENDER,BASE_SALARY' qs ="DEPARtMENT in depts and GEnDER ==Female and 80000 <=BASESALARY <=120000 emp_filtered emp loyee query (gs) emp_filteredlselect-co lumns]. head 对 DataFrame的行做mask,使得所有满足条件的数据都消失 criteria= c1 C2 movie mask (criteria).head 对不满足条件的值进行替换,使用 pandas的 where语句 s= pd series(range(5) s where(s>0) s where(s>1, 10) split-apply-combine common data analysis pattern of breaking up data into independent manageable chunks independently applying functions to these chunks, and then combining the results back together Split agg作用的对象,默认是作用的是所有剩余的列 unt order ext price Combine 3830801000123583 383080 10001 (transform) Input Apply(sum) account order ext price Order Total 383080 100uU1 107.97 38308010001235 83 576 unt order ext pi count order ext pric. 3830801000123583 38300002612 accoun:ordcrcxt price 38308010001 830801000157612 3830801c011079757612 330801000110797 10005 67936 4122901000526793818549 41229010005267936 0005 41229010005 286.02 count order ext pri 41229010005286.2818549 412904100032541201005 3472.D4 2290100058185 41290100058329519549 n12290100053472CA 412290 4122901005347204818549 2189510003061.12 412290100091512818549 2188951000651865 acd unt crde 21889510006305112372449 218895100062169 account ext price 2188951006 218951000030612 100063724.49 2188951651865372449 18895100062169 100c6 51865 2198951co0672.18372449 18895 10006 2169 10006 7218 上述思路对于到 pandas中就是 groupby #按照 AIRLINE分组,使用agg方法,传入要聚合的列和聚合函数 flights. groupby( AIRLINE), agg(i 'Arr_Delay: 'mean]). head 或者要选取的列使用索引,聚合函数作为字符串传入agg flights. groupby('AIRLINE ')['ARR_DELAY]. agg ('mean')head flights. groupby( AIRLINE )['ARR_DELAY]. meanO. head #分组可以是多组,选取可以是多组,聚合函数也可以是多个,此时—对应 flights. groupby(['AIRLINE,'WEEKDAY])['CANCELLED','DIVERTED']agg([ sum mean']). head(7) #可以对同一列施加不同的函数 group_cols=[ORG_AIR,DEST_AIR] agg_dict =t CANCELLED:[ sum,mean,size ' 'AIR-TIME:[mean,var J flights. groupby (group_cols). agg cagg _dict). head #下面这个例子中, max deviation是自定义的函数 def max deviation (s): std_score=(s -s mean o)/s. stdo return std_score. abs(. maxo college. groupby( STABBR')['UGDS] agg(max_deviation). round (1). head grouped= college. groupby(['STABBR,'RELAFFIL]) grouped. ngroups#用 ngroups属性查看分组的数量 list(grouped groups. keys o) filter(用来筛选数据, transform(产生新的数据 if we want to get a single value for each group-> use aggregate) if we want to get a subset of the input rows-> use filter if we want to get a new value for each input row-> use transform) 对某一列实施复杂操作,用app1y(函数数据清理 stack方法可以将每一行所有的列值转换为行值 unstack方法可以将其还原 state fruit pd. read csv(data/state fruit. csv, index col=0) state fruit. stack()#再使用rest_ index将结果转换为 dataframe;给. columns赋值可以重命名列 #也可以使用 rename_axis给不同的行索引层级命名 state_fruit. stack(. rename_axis([ 'state,fruit]. reset_index (name=weight 用 read csv方法只选取特征的列指定usco1s参数 usecol func = lambda x: 'ugds in x or x== INSTNM' college pd read_csv( data/college. csv, useco Is=usecol-func) 透视表 pivot_ table,透视针对的对象是不同的列名 1. The index parameter takes a column(or columns)that will not be pivoted and whose unique values will be placed in the index 2. The columns parameter takes a column(or columns)that will be pivoted and whose unique values will be made into column names 3. The values parameter takes a column(or columns)that will be aggregated 4. aggfunc parameter determines how the columns in the values parameter get aggregated Ihttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivottable.html dataframe拼接 Concat gives the flexibility to join based on the axis( all rows or all columns) Append is the specific case(axis=0, join='outer')of concat Join is based on the indexes(set by set_index)on how variable =[left, 'right, inner, 'couter] merge is based on any particular column each of the two dataframes, this columns are variables on like left_on, 'right_on,on sex-age=W1melt[' sex age'].str.5p1it( expand=True)#此时其实可以使用字符串的多个方法。 movie. Insert(o,'id',np. arange(len( move))#插入新的列添加新行直接用1oc就能指定 new_datalist =[Al names. loc[4]=new_data_list 等价于 names. loc[4]=[Zach, 3] names. append(I"Name':"Aria',"Age':1}, I gnore_ index-True)# append方法可以同时添加多行,此时要放在列表中。 data_dict bbal1-_16.iloc[o].to_dicto #keys参数可以给两个 Dataframe命名, names参数可以重命名每个索引层 pd concat(s_list, keys=[2016,2017], names=[ Year,' Symbol]) pres_41-45[ President'J value_counts) 时间 pd. to_datetime is capable of converting entire lists or Series of strings or integers to Timestamps #引入 datetime模块,创建date、time和 datetime对象 date= datetime. date (year=2013, month=6, day=7) date is2013-06-07 # time 1s12:30:19.463198 # datetime is2013-06-0712:30:19.463198 s=pd. series(['12-5-2015,"14-1-2013',"20/12/2017',"40/23/20171]) pd to_datetime(s, dayfirst-True, errors=coerce') pd. Timestamp (year=2012, month=12, day=21, hour=5, minute=10, second=8, microsecond=99) pd Timestamp( 2016/1/10) pd. TImestamp(2016-01-05T05:34:43.123456789) pd. T1 tamp(500)#可以传递整数,表示距离1970-01-0100:00:00.00000000的毫秒数 d. to_datetime c'2015-5-13)#类似函数有pd. to dataframe to_ timedel ta函数可以产生一个 Timed1ta对象 pd Timedelta( 12 days 5 hours 3 minutes 123456789 nanoseconds ') time strings =[2 days 24 minutes 89. 67 seconds,00: 45: 23.6'] dto_timedeltactime_strings) # Timedeltas对象可以和 TImestamps互相加减,甚至可以相除返回一个浮点数 pd. Timede l ta( 12 days 5 hours 3 minutes )*2 ts=pd. TImes tamp(2016-10-14:23:23.91) ts.ceil(h')# TImestamp("2016-10-0105:00:001) td total_seconds o 可以在导入的时候将时间列设为 index,然后可以加快速度,时间支持部分匹配 # REPORTED DATE设为了行索引,所以就可以进行智能 TImestamp对象切分 crime= crime. set index(' REPORTED_DATE)#, sort_index) crime.loc['2016-05-1216:45:00’] #选取2012-06的数据 crime sort.loc[: 2012-06'] crime.loc['2016-05-121] #也可以选取一整月、一整年或某天的某小时 crime. loc[2016-05 ]. shape crime. loc[2016].shape crime. loc[2016-05-12 03]. shape crime. loc[Dec 2015].sort_index o #用at_time方法选取特定时间 crime. at time(5: 47 ). head crime. plot(figsize=(16, 4), title=All Denver crimes) crimesort resample(Qs-MAR'['ISCRIME,IS_TRAFFIC]. sumO. head Concept Scalar Class Array Class pandas Data Type Primary Creation Method Date datetime 64 lns or to datetime or TImes tam Datetimeindex times datetime 64 [ns, tz] date range Time Timedelta TimedeltaIndex timede1ta64 [ns] to timedelta or deltas timed l ta_range Time Period PeriodIndex periodlfreg] Period or period_range pans Date offsets title =Denver Crimes and Traffic Accidents per year crime[REPorTEd_Date]. dt. year value_counts sort__index\ plot(kind=barh, title=title) import seaborn as sns sns heatmap (crime_table, cmap=Greys ') Matρloib提供了两种方法来作图:面向过程和面向对象,推荐使用面向过程 =[-3,5,7],y=[10,2,5] fig, ax =plt. subplots(figsize=(15, 3)) ax. plot(x, y) ax set_xlim( 0, 10) axset_ylim(3, 8) ax set._ xlabel(x axis) ax set_ylabel('Y axis') ax set_(line plot fig suptitle( Figure Title, size=20, y=1. 03) med_ budget_rol. index. values#将数据转换为 numpy,然后再使用p1t绘图 pandas绘图 df= pd DataFrame(index=[ 'Atiya,'Abbas,Cornelia,'Stephanie,'Monte'] data={ Apples':[20,10,40,20,50 oranges':[35,40,25,19,33]}) color =[. 7’] df plot(kind= 'bar, color=color, figsize=(16, 4)) fig,(axl, ax2, ax3)= plt. subplots (1,3, figsize=(16, 4)) fig suptitle( two variable plots, size=20, y=l02) df. plot(kind=line, color-color, ax=ax1, title-Line plot df plot(x='Apples, y=oranges, kind=scatter, color=color, ax=ax2, title=Scatterplot df. plot(kind= 'bar, color=color, ax=ax3, title=Bar plot ')

(系统自动生成,下载前可以参看下载内容)