Presentation is loading. Please wait.

Presentation is loading. Please wait.

R & Python with Anaconda for data analysis

Similar presentations


Presentation on theme: "R & Python with Anaconda for data analysis"— Presentation transcript:

1 R & Python with Anaconda for data analysis
Hongfei Yan 2016/11/30

2 Origin: 中国大数据IT应用 根据赛迪发布的《2015-2016年中国大数据市场研究年度报告》,
以互联网行业占比最高(35.5%), 其次是电信领域(19.3%), 第三为金融领域(18.2)。

3 R and Python are two of the most popular data science languages
Conda, the leading package and environment manager for data science. Conda works with both R and Python packages, allowing you to easily manage and switch between separate environments built with different versions of R, Python, and their associated packages. Anaconda, includes conda plus over 330 of the most popular Python packages for science, math, engineering, and data analysis. Allows you to install over 300 R packages. Allows you to develop on Windows, Mac, or Linux.  R and Python are two of the most popular data science languages! 

4

5 Contents RStudio, https://www.rstudio.com/
RStudio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Anaconda, Pandas Data Analysis Library, pandas is an open source, BSD-licensed library providing high- performance, easy-to-use data structures and data analysis tools for the Python programming language.

6

7 Python Cheat Sheet

8

9 RStudio RStudio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. RStudio is available in two editions: RStudio Desktop, where the program is run locally as a regular desktop application; and RStudio Server, which allows accessing RStudio using a web browser while it is running on a remote Linux server. Prepackaged distributions of RStudio Desktop are available for Windows, OS X, and Linux.

10

11

12

13 Eclipse (software) Eclipse is an integrated development environment (IDE) used in computer programming, and is the most widely used Java IDE. It contains a base workspace and an extensible plug-in system for customizing the environment. Eclipse is written mostly in Java and its primary use is for developing Java applications, but it may also be used to develop applications in other programming languages

14

15 http://app. finance. china. com. cn/stock/quote/history. php

16 K线图 K线图的画法包含四个数据, 即开盘价、最高价、最低价、 收盘价, 所有的k线都是围绕这四个数据 展开,反映大势的状况和价格 信息。
如果把每日的K线图放在一张纸 上,就能得到日K线图,同样也 可画出周K线图、月K线图。

17 2016/11/29

18 数据格式

19 In statistical inference, a subset of the population (a statistical sample) is chosen to represent the population in a statistical analysis. If a sample is chosen properly, characteristics of the entire population that the sample is drawn from can be estimated from corresponding characteristics of the sample.

20 date closingPrice openingPrice riseFall highestPrice lowestPrice volume turnover 0.35 E7 0.37 -0.41 E7 ……

21 Statistical Data Analysis
names = c('date', 'closingPrice', 'openingPrice', 'riseFall', 'highestPrice','lowestPrice', 'volume', 'turnover') szzs = read.table(' szzs_27.in', col.names=names) View(szzs)

22 Statistical Data Analysis
szzs[-1] summary(szzs[-1]) First quartile Second quartile

23 Quantile (1/3) In statistics and the theory of probability,  quantiles are cutpoints dividing the range of a probability distribution into contiguous intervals with equal probabilities, or dividing the observations in a sample in the same way. 四分位数(Quartile),即统计学中,把所有数值由 小到大排列并分成四等份,处于三个分割点位置的 数值就是四分位数。 一般中间矩形箱的上下两边分别为数据集的上四分 位数(75%,Q3)和下四分位数(25%,Q1),中间的横 线代表数据集的中位数(50%,Media,Q2)

24 Statistical Data Analysis

25 Statistical Data Analysis
> nuclear <- c(7, 20, 16, 6, 58, 9, 20, 50, 23, 33, 8, 10, 15, 16, 104) > quantile(nuclear) 0% 25% 50% 75% 100%

26 Quantile (1/3) For a population, of discrete values or for a continuous population density, the k-th q-quantile is the data value Estimating quantiles from a sample 離散均勻分配(Discrete Uniform on Distribution) 背景: 若隨機變數有n個不同值,具有相同機率,則我們稱之為離散型均勻分配,通常這都發生在 我們不確定各種情況發生的機會,且認為每個機會都相等,例如:投擲骰子、銅幣、、、等等 定義: 設離散隨機變數X之可能變量有1,2,…,n, 若其機率函數為 f(x) = 1/n x = 1,2,…,n 則此種機率分配稱為離散均勻分配 利用样本计算分位数,所以需要用到线性插值

27 import math import functools def percentile(N, percent, key=lambda x:x): """ Find the percentile of a list of values. @parameter N - is a list of values. Note N MUST BE already sorted. @parameter percent - a float value from 0.0 to 1.0. @parameter key - optional key function to compute value from each element of N. @return - the percentile of the values if not N: return None k = (len(N)-1) * percent f = math.floor(k) c = math.ceil(k) if f == c: return key(N[int(k)]) d0 = key(N[int(f)]) * (c-k) d1 = key(N[int(c)]) * (k-f) return d0+d1 # median is 50th percentile. median = functools.partial(percentile, percent=0.5) Q1 = percentile(range(10),0.25) print(Q1) Q3 = percentile(range(10),0.75) print(Q3) Q2 = median(range(10)) print(Q2) A_Q2 = median(range(11)) print(A_Q3)

28

29 Data analysis for szzs

30 Spyder (Python 3.5)

31

32

33 Statistical Data Analysis

34 Statistical Data Analysis

35 这是python画的,在annaconda的Spyder IDE中。参考了http://stackoverflow
#import statistics #import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np from scipy.stats import gaussian_kde fig = plt.figure() #### summary( data) #print('\tV2\tV3\tV4\tV5\tV6\tV7\tV8') cnt = 0 for row in result[1:]: cnt += 1 nrow = [ float(i) for i in row ] #print(nrow) sort_num = sorted(nrow) density = gaussian_kde(sort_num) xs = np.linspace(0,8,200) density.covariance_factor = lambda : .25 density._compute_covariance() ax = fig.add_subplot(2,1,cnt) ax.plot(sort_num, density(sort_num)) if cnt==2: break print( 'Min.\t:{0:8.2f}\t1st Qu.\t:{1:8.2f}\tMedian={2:8.2f}\tMean\t:{3:8.2f}\t3rd Qu.\t:{4:8.2f}\tMax\t:{5:8.2f}' \ .format( min(sort_num), percentile(sort_num,0.25), percentile(sort_num,0.50),\ mean(sort_num),percentile(sort_num,0.75),max(sort_num) ) \ ) #print( 'StatMean\t:{:8.4f}\t'.format( statistics.mean(sort_num) ), end='') plt.show()

36 Pandas Data Analysis Library
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

37 # -*- coding: utf-8 -*- """ import pandas as pd # Reading data locally #df = pd.read_csv('/Users/al-ahmadgaidasaad/Documents/d.csv') # Reading data from web #data_url = " data_url = " #data_url = " #df = pd.read_csv(data_url) names = ['date', 'closingPrice', 'openingPrice', 'riseFall', 'highestPrice',\ 'lowestPrice', 'volume', 'turnover'] #df = pd.read_table(data_url, header=None) df = pd.read_table(data_url, names=names) # Head of the data print (df.head()) print (df.describe()) # Import the module for plotting import matplotlib.pyplot as plt #to drop by column number instead of by column label #where 1 is the axis number (0 for rows and 1 for columns.) #df_f = df.drop(df.columns[[0,3, 6, 7]], axis = 1) df_show = df.drop(['date','riseFall', 'volume', 'turnover'], axis = 1) plt.show(df_show.plot(kind = 'box'))


Download ppt "R & Python with Anaconda for data analysis"

Similar presentations


Ads by Google