gimi9 Pandas Profiling

Dataset statistics

Number of variables	9
Number of observations	25
Missing cells	113
Missing cells (%)	50.2%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	1.9 KiB
Average record size in memory	78.1 B

Variable types

Unsupported	4
Categorical	5

Dataset

Description	부동산 가격공시에 관한 법률에 의거 개별공시지가 산정을 위한 공시지가 토지특성(2022)
Author	국토교통부
URL	http://data.nsdi.go.kr/dataset/20220727ds00002

Alerts

`Unnamed: 8` has constant value "참조테이블명/비고"	Constant
`Unnamed: 6` is highly correlated with `Unnamed: 2` and 1 other fields	High correlation
`Unnamed: 2` is highly correlated with `Unnamed: 6` and 2 other fields	High correlation
`Unnamed: 1` is highly correlated with `Unnamed: 2` and 1 other fields	High correlation
`Unnamed: 3` is highly correlated with `Unnamed: 6` and 2 other fields	High correlation
`테이블정의서` has 1 (4.0%) missing values	Missing
`Unnamed: 1` has 6 (24.0%) missing values	Missing
`Unnamed: 2` has 3 (12.0%) missing values	Missing
`Unnamed: 3` has 5 (20.0%) missing values	Missing
`Unnamed: 4` has 5 (20.0%) missing values	Missing
`Unnamed: 5` has 25 (100.0%) missing values	Missing
`Unnamed: 6` has 22 (88.0%) missing values	Missing
`Unnamed: 7` has 22 (88.0%) missing values	Missing
`Unnamed: 8` has 24 (96.0%) missing values	Missing
`테이블정의서` is an unsupported type, check if it needs cleaning or further analysis	Unsupported
`Unnamed: 4` is an unsupported type, check if it needs cleaning or further analysis	Unsupported
`Unnamed: 5` is an unsupported type, check if it needs cleaning or further analysis	Unsupported
`Unnamed: 7` is an unsupported type, check if it needs cleaning or further analysis	Unsupported

Reproduction

Analysis started	2022-08-14 13:01:46.075690
Analysis finished	2022-08-14 13:01:48.920493
Duration	2.84 seconds
Software version	pandas-profiling v3.2.0
Download configuration	config.json

테이블정의서
Unsupported

MISSING
REJECTED
UNSUPPORTED

Missing	1
Missing (%)	4.0%
Memory size	328.0 B

Unnamed: 1
Categorical

HIGH CORRELATION
MISSING

Distinct	19
Distinct (%)	100.0%
Missing	6
Missing (%)	24.0%
Memory size	328.0 B

컬럼ID	1
STDMT	1
PNU	1
LAND_SEQNO	1
SGG_CD	1
Other values (14)	14

Length

Max length	11
Median length	9
Mean length	6.368421053
Min length	3

Unique

Unique	19 ?
Unique (%)	100.0%

Sample

1st row	컬럼ID
2nd row	STDMT
3rd row	PNU
4th row	LAND_SEQNO
5th row	SGG_CD

Common Values

Value	Count	Frequency (%)
컬럼ID	1	4.0%
STDMT	1	4.0%
PNU	1	4.0%
LAND_SEQNO	1	4.0%
SGG_CD	1	4.0%
LAND_LOC_CD	1	4.0%
LAND_GBN	1	4.0%
BOBN	1	4.0%
BUBN	1	4.0%
ADM_UMD_CD	1	4.0%
Other values (9)	9	36.0%
(Missing)	6	24.0%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
컬럼id	1	5.3%
pnilp	1	5.3%
geo_form	1	5.3%
geo_hl	1	5.3%
land_use	1	5.3%
spfc2	1	5.3%
spfc1	1	5.3%
parea	1	5.3%
jimok	1	5.3%
adm_umd_cd	1	5.3%
Other values (9)	9	47.4%

Unnamed: 2
Categorical

HIGH CORRELATION
MISSING

Distinct	22
Distinct (%)	100.0%
Missing	3
Missing (%)	12.0%
Memory size	328.0 B

부번	1
공시지가 토지특성	1
컬럼명	1
기준월	1
필지고유번호	1
Other values (17)	17

Length

Max length	20
Median length	7
Mean length	5.272727273
Min length	2

Unique

Unique	22 ?
Unique (%)	100.0%

Sample

1st row	부동산 제공 표준 데이터셋 v1.83
2nd row	공시지가 토지특성
3rd row	컬럼명
4th row	기준월
5th row	필지고유번호

Common Values

Value	Count	Frequency (%)
부번	1	4.0%
공시지가 토지특성	1	4.0%
컬럼명	1	4.0%
기준월	1	4.0%
필지고유번호	1	4.0%
토지일련번호	1	4.0%
시군구코드	1	4.0%
토지소재지코드	1	4.0%
토지구분	1	4.0%
본번	1	4.0%
Other values (12)	12	48.0%
(Missing)	3	12.0%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
부번	1	3.7%
표준	1	3.7%
행정읍면동코드	1	3.7%
도로접면	1	3.7%
지형형상	1	3.7%
지형고저	1	3.7%
토지이용상황	1	3.7%
용도지역2	1	3.7%
용도지역1	1	3.7%
면적	1	3.7%
Other values (17)	17	63.0%

Unnamed: 3
Categorical

HIGH CORRELATION
MISSING

Distinct	5
Distinct (%)	25.0%
Missing	5
Missing (%)	20.0%
Memory size	328.0 B

CHAR	12
VARCHAR2	3
NUMBER	3
테이블ID	1
타입	1

Length

Max length	8
Median length	4
Mean length	4.85
Min length	2

Unique

Unique	2 ?
Unique (%)	10.0%

Sample

1st row	테이블ID
2nd row	타입
3rd row	CHAR
4th row	VARCHAR2
5th row	NUMBER

Common Values

Value	Count	Frequency (%)
CHAR	12	48.0%
VARCHAR2	3	12.0%
NUMBER	3	12.0%
테이블ID	1	4.0%
타입	1	4.0%
(Missing)	5	20.0%

Length

Histogram of lengths of the category

Category Frequency Plot

Value	Count	Frequency (%)
char	12	60.0%
varchar2	3	15.0%
number	3	15.0%
테이블id	1	5.0%
타입	1	5.0%

Unnamed: 4
Unsupported

MISSING
REJECTED
UNSUPPORTED

Missing	5
Missing (%)	20.0%
Memory size	328.0 B

Unnamed: 5
Unsupported

MISSING
REJECTED
UNSUPPORTED

Missing	25
Missing (%)	100.0%
Memory size	353.0 B

Unnamed: 6
Categorical

HIGH CORRELATION
MISSING

Distinct	3
Distinct (%)	100.0%
Missing	22
Missing (%)	88.0%
Memory size	328.0 B

작성일	1
테이블명	1
PK/FK	1

Length

Max length	5
Median length	4
Mean length	4
Min length	3

Unique

Unique	3 ?
Unique (%)	100.0%

Sample

1st row	작성일
2nd row	테이블명
3rd row	PK/FK

Common Values

Value	Count	Frequency (%)
작성일	1	4.0%
테이블명	1	4.0%
PK/FK	1	4.0%
(Missing)	22	88.0%

Length

Histogram of lengths of the category

Category Frequency Plot

Value	Count	Frequency (%)
작성일	1	33.3%
테이블명	1	33.3%
pk/fk	1	33.3%

Unnamed: 7
Unsupported

MISSING
REJECTED
UNSUPPORTED

Missing	22
Missing (%)	88.0%
Memory size	328.0 B

Unnamed: 8
Categorical

CONSTANT
MISSING
REJECTED

Distinct	1
Distinct (%)	100.0%
Missing	24
Missing (%)	96.0%
Memory size	328.0 B

참조테이블명/비고	1

Length

Max length	9
Median length	9
Mean length	9
Min length	9

Unique

Unique	1 ?
Unique (%)	100.0%

Sample

1st row	참조테이블명/비고

Common Values

Value	Count	Frequency (%)
참조테이블명/비고	1	4.0%
(Missing)	24	96.0%

Length

Histogram of lengths of the category

Category Frequency Plot

Value	Count	Frequency (%)
참조테이블명/비고	1	100.0%

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.

First rows

	테이블정의서	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
0	작성자	<NA>	부동산 제공 표준 데이터셋 v1.83	<NA>	NaN	<NA>	작성일	2017	<NA>
1	주제영역명	<NA>	<NA>	테이블ID	APMM_NV_LAND_OPEN	<NA>	테이블명	공시지가 토지특성	<NA>
2	테이블설명	<NA>	공시지가 토지특성	<NA>	NaN	<NA>	<NA>	NaN	<NA>
3	No	컬럼ID	컬럼명	타입	길이(Byte)	<NA>	PK/FK	Default	참조테이블명/비고
4	1	STDMT	기준월	CHAR	2	<NA>	<NA>	NaN	<NA>
5	2	PNU	필지고유번호	VARCHAR2	19	<NA>	<NA>	NaN	<NA>
6	3	LAND_SEQNO	토지일련번호	NUMBER	6	<NA>	<NA>	NaN	<NA>
7	4	SGG_CD	시군구코드	CHAR	5	<NA>	<NA>	NaN	<NA>
8	5	LAND_LOC_CD	토지소재지코드	CHAR	5	<NA>	<NA>	NaN	<NA>
9	6	LAND_GBN	토지구분	CHAR	1	<NA>	<NA>	NaN	<NA>

Last rows

	테이블정의서	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
15	12	PAREA	면적	NUMBER	17,5	<NA>	<NA>	NaN	<NA>
16	13	SPFC1	용도지역1	CHAR	2	<NA>	<NA>	NaN	<NA>
17	14	SPFC2	용도지역2	CHAR	2	<NA>	<NA>	NaN	<NA>
18	15	LAND_USE	토지이용상황	VARCHAR2	3	<NA>	<NA>	NaN	<NA>
19	16	GEO_HL	지형고저	CHAR	2	<NA>	<NA>	NaN	<NA>
20	17	GEO_FORM	지형형상	CHAR	2	<NA>	<NA>	NaN	<NA>
21	18	ROAD_SIDE	도로접면	CHAR	2	<NA>	<NA>	NaN	<NA>
22	인덱스명	<NA>	인덱스키	<NA>	NaN	<NA>	<NA>	NaN	<NA>
23	NaN	<NA>	<NA>	<NA>	NaN	<NA>	<NA>	NaN	<NA>
24	업무규칙	<NA>	<NA>	<NA>	NaN	<NA>	<NA>	NaN	<NA>

Overview

Variables

Common Values

Length

Common Values

Length

Common Values

Length

Category Frequency Plot

Common Values

Length

Category Frequency Plot

Common Values

Length

Category Frequency Plot

Correlations

Pearson's r

Spearman's ρ

Kendall's τ

Phik (φk)

Cramér's V (φc)

Missing values

Sample

First rows

Last rows