gimi9 Pandas Profiling

Dataset statistics

Number of variables	6
Number of observations	10000
Missing cells	4088
Missing cells (%)	6.8%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	576.2 KiB
Average record size in memory	59.0 B

Variable types

Text	3
Categorical	2
Numeric	1

Dataset

Description	파일 다운로드
Author	서울특별시
URL	https://data.seoul.go.kr/dataList/OA-15749/S/1/datasetView.do

Alerts

`작업_일자"` has constant value ""	Constant
`층_구분_코드` is highly imbalanced (80.5%)	Imbalance
`동명칭` has 3940 (39.4%) missing values	Missing
`호_명` has 148 (1.5%) missing values	Missing
`관리_폐쇄말소대장_PK` has unique values	Unique
`층_번호` has 4372 (43.7%) zeros	Zeros

Reproduction

Analysis started	2024-05-11 05:37:01.089673
Analysis finished	2024-05-11 05:37:02.771217
Duration	1.68 second
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

관리_폐쇄말소대장_PK
Text

UNIQUE

Distinct	10000
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	15
Median length	11
Mean length	11.1824
Min length	10

Characters and Unicode

Total characters	111824
Distinct characters	11
Distinct categories	2 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	10000 ?
Unique (%)	100.0%

Sample

1st row	11620-12907
2nd row	11470-13534
3rd row	11620-20385
4th row	11500-23424
5th row	11650-15591

Value	Count	Frequency (%)
11620-12907	1	< 0.1%
11500-10544	1	< 0.1%
11680-14238	1	< 0.1%
11470-12803	1	< 0.1%
11530-6202	1	< 0.1%
11470-15001	1	< 0.1%
11650-11279	1	< 0.1%
11500-19733	1	< 0.1%
11620-19913	1	< 0.1%
11560-100009255	1	< 0.1%
Other values (9990)	9990	99.9%

Most occurring characters

Value	Count	Frequency (%)
1	31111	27.8%
0	18068	16.2%
5	10457	9.4%
-	10000	8.9%
6	7611	6.8%
4	7398	6.6%
2	7013	6.3%
9	5241	4.7%
7	5233	4.7%
3	4879	4.4%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	101824	91.1%
Dash Punctuation	10000	8.9%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
1	31111	30.6%
0	18068	17.7%
5	10457	10.3%
6	7611	7.5%
4	7398	7.3%
2	7013	6.9%
9	5241	5.1%
7	5233	5.1%
3	4879	4.8%
8	4813	4.7%

Dash Punctuation

Value	Count	Frequency (%)
-	10000	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	111824	100.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
1	31111	27.8%
0	18068	16.2%
5	10457	9.4%
-	10000	8.9%
6	7611	6.8%
4	7398	6.6%
2	7013	6.3%
9	5241	4.7%
7	5233	4.7%
3	4879	4.4%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	111824	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
1	31111	27.8%
0	18068	16.2%
5	10457	9.4%
-	10000	8.9%
6	7611	6.8%
4	7398	6.6%
2	7013	6.3%
9	5241	4.7%
7	5233	4.7%
3	4879	4.4%

동명칭
Text

MISSING

Distinct	468
Distinct (%)	7.7%
Missing	3940
Missing (%)	39.4%
Memory size	156.2 KiB

Length

Max length	18
Median length	4
Mean length	3.8858086
Min length	1

Characters and Unicode

Total characters	23548
Distinct characters	263
Distinct categories	8 ?
Distinct scripts	3 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	130 ?
Unique (%)	2.1%

Sample

1st row	109동
2nd row	가동
3rd row	5동
4th row	223동
5th row	1동

Value	Count	Frequency (%)
101동	412	6.5%
102동	276	4.4%
1동	169	2.7%
201동	156	2.5%
2동	155	2.4%
가동	154	2.4%
103동	133	2.1%
여의도자이	114	1.8%
105동	98	1.5%
나동	92	1.5%
Other values (478)	4585	72.3%

Most occurring characters

Value	Count	Frequency (%)
동	5183	22.0%
1	3654	15.5%
0	2280	9.7%
2	1560	6.6%
3	1287	5.5%
4	596	2.5%
6	576	2.4%
5	524	2.2%
7	414	1.8%
8	372	1.6%
Other values (253)	7102	30.2%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	11525	48.9%
Other Letter	11515	48.9%
Space Separator	284	1.2%
Uppercase Letter	164	0.7%
Close Punctuation	20	0.1%
Open Punctuation	20	0.1%
Other Punctuation	14	0.1%
Dash Punctuation	6	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
동	5183	45.0%
가	300	2.6%
이	236	2.0%
스	225	2.0%
상	205	1.8%
트	158	1.4%
파	150	1.3%
산	149	1.3%
아	140	1.2%
도	139	1.2%
Other values (227)	4630	40.2%

Decimal Number

Value	Count	Frequency (%)
1	3654	31.7%
0	2280	19.8%
2	1560	13.5%
3	1287	11.2%
4	596	5.2%
6	576	5.0%
5	524	4.5%
7	414	3.6%
8	372	3.2%
9	262	2.3%

Uppercase Letter

Value	Count	Frequency (%)
A	54	32.9%
T	32	19.5%
V	32	19.5%
B	32	19.5%
C	5	3.0%
D	3	1.8%
G	2	1.2%
S	2	1.2%
E	2	1.2%

Other Punctuation

Value	Count	Frequency (%)
.	12	85.7%
,	1	7.1%
*	1	7.1%

Space Separator

Value	Count	Frequency (%)
	284	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	20	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	20	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	6	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	11869	50.4%
Hangul	11515	48.9%
Latin	164	0.7%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
동	5183	45.0%
가	300	2.6%
이	236	2.0%
스	225	2.0%
상	205	1.8%
트	158	1.4%
파	150	1.3%
산	149	1.3%
아	140	1.2%
도	139	1.2%
Other values (227)	4630	40.2%

Common

Value	Count	Frequency (%)
1	3654	30.8%
0	2280	19.2%
2	1560	13.1%
3	1287	10.8%
4	596	5.0%
6	576	4.9%
5	524	4.4%
7	414	3.5%
8	372	3.1%
	284	2.4%
Other values (7)	322	2.7%

Latin

Value	Count	Frequency (%)
A	54	32.9%
T	32	19.5%
V	32	19.5%
B	32	19.5%
C	5	3.0%
D	3	1.8%
G	2	1.2%
S	2	1.2%
E	2	1.2%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	12033	51.1%
Hangul	11515	48.9%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
동	5183	45.0%
가	300	2.6%
이	236	2.0%
스	225	2.0%
상	205	1.8%
트	158	1.4%
파	150	1.3%
산	149	1.3%
아	140	1.2%
도	139	1.2%
Other values (227)	4630	40.2%

ASCII

Value	Count	Frequency (%)
1	3654	30.4%
0	2280	18.9%
2	1560	13.0%
3	1287	10.7%
4	596	5.0%
6	576	4.8%
5	524	4.4%
7	414	3.4%
8	372	3.1%
	284	2.4%
Other values (16)	486	4.0%

호_명
Text

MISSING

Distinct	2011
Distinct (%)	20.4%
Missing	148
Missing (%)	1.5%
Memory size	156.2 KiB

Length

Max length	14
Median length	4
Mean length	4.1098254
Min length	1

Characters and Unicode

Total characters	40490
Distinct characters	78
Distinct categories	9 ?
Distinct scripts	3 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	1238 ?
Unique (%)	12.6%

Sample

1st row	802
2nd row	303호
3rd row	B01호
4th row	312
5th row	306호

Value	Count	Frequency (%)
101호	262	2.6%
201호	257	2.6%
202호	199	2.0%
102호	183	1.8%
302호	148	1.5%
301호	142	1.4%
103호	134	1.3%
203호	125	1.3%
303호	112	1.1%
201	104	1.0%
Other values (1957)	8260	83.2%

Most occurring characters

Value	Count	Frequency (%)
0	8646	21.4%
1	7163	17.7%
호	5913	14.6%
2	4721	11.7%
3	3100	7.7%
4	2007	5.0%
5	1859	4.6%
층	1378	3.4%
6	1238	3.1%
7	909	2.2%
Other values (68)	3556	8.8%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	31100	76.8%
Other Letter	8523	21.0%
Dash Punctuation	525	1.3%
Uppercase Letter	236	0.6%
Space Separator	74	0.2%
Close Punctuation	13	< 0.1%
Open Punctuation	13	< 0.1%
Lowercase Letter	4	< 0.1%
Other Punctuation	2	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
호	5913	69.4%
층	1378	16.2%
지	419	4.9%
제	195	2.3%
하	106	1.2%
이	74	0.9%
가	73	0.9%
비	60	0.7%
상	51	0.6%
에	49	0.6%
Other values (40)	205	2.4%

Decimal Number

Value	Count	Frequency (%)
0	8646	27.8%
1	7163	23.0%
2	4721	15.2%
3	3100	10.0%
4	2007	6.5%
5	1859	6.0%
6	1238	4.0%
7	909	2.9%
8	795	2.6%
9	662	2.1%

Uppercase Letter

Value	Count	Frequency (%)
B	201	85.2%
A	14	5.9%
D	10	4.2%
C	4	1.7%
O	2	0.8%
E	2	0.8%
L	1	0.4%
S	1	0.4%
F	1	0.4%

Lowercase Letter

Value	Count	Frequency (%)
a	2	50.0%
b	1	25.0%
c	1	25.0%

Other Punctuation

Value	Count	Frequency (%)
*	1	50.0%
/	1	50.0%

Dash Punctuation

Value	Count	Frequency (%)
-	525	100.0%

Space Separator

Value	Count	Frequency (%)
	74	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	13	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	13	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	31727	78.4%
Hangul	8523	21.0%
Latin	240	0.6%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
호	5913	69.4%
층	1378	16.2%
지	419	4.9%
제	195	2.3%
하	106	1.2%
이	74	0.9%
가	73	0.9%
비	60	0.7%
상	51	0.6%
에	49	0.6%
Other values (40)	205	2.4%

Common

Value	Count	Frequency (%)
0	8646	27.3%
1	7163	22.6%
2	4721	14.9%
3	3100	9.8%
4	2007	6.3%
5	1859	5.9%
6	1238	3.9%
7	909	2.9%
8	795	2.5%
9	662	2.1%
Other values (6)	627	2.0%

Latin

Value	Count	Frequency (%)
B	201	83.8%
A	14	5.8%
D	10	4.2%
C	4	1.7%
a	2	0.8%
O	2	0.8%
E	2	0.8%
L	1	0.4%
S	1	0.4%
b	1	0.4%
Other values (2)	2	0.8%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	31967	79.0%
Hangul	8523	21.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	8646	27.0%
1	7163	22.4%
2	4721	14.8%
3	3100	9.7%
4	2007	6.3%
5	1859	5.8%
6	1238	3.9%
7	909	2.8%
8	795	2.5%
9	662	2.1%
Other values (18)	867	2.7%

Hangul

Value	Count	Frequency (%)
호	5913	69.4%
층	1378	16.2%
지	419	4.9%
제	195	2.3%
하	106	1.2%
이	74	0.9%
가	73	0.9%
비	60	0.7%
상	51	0.6%
에	49	0.6%
Other values (40)	205	2.4%

층_구분_코드
Categorical

IMBALANCE

Distinct	2
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

20	9698
10	302

Length

Max length	2
Median length	2
Mean length	2
Min length	2

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	20
2nd row	20
3rd row	20
4th row	20
5th row	20

Common Values

Value	Count	Frequency (%)
20	9698	97.0%
10	302	3.0%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
20	9698	97.0%
10	302	3.0%

층_번호
Real number (ℝ)

ZEROS

Distinct	40
Distinct (%)	0.4%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Mean	3.8178

Minimum	0
Maximum	39
Zeros	4372
Zeros (%)	43.7%
Negative	0
Negative (%)	0.0%
Memory size	166.0 KiB

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
median	1
Q3	5
95-th percentile	16
Maximum	39
Range	39
Interquartile range (IQR)	5

Descriptive statistics

Standard deviation	5.6528222
Coefficient of variation (CV)	1.4806491
Kurtosis	3.721426
Mean	3.8178
Median Absolute Deviation (MAD)	1
Skewness	1.8930508
Sum	38178
Variance	31.954399
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=40)

Value	Count	Frequency (%)
0	4372	43.7%
1	997	10.0%
2	734	7.3%
3	599	6.0%
4	428	4.3%
5	378	3.8%
7	286	2.9%
6	265	2.6%
9	210	2.1%
8	206	2.1%
Other values (30)	1525	15.2%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
0	4372	43.7%
1	997	10.0%
2	734	7.3%
3	599	6.0%
4	428	4.3%
5	378	3.8%
6	265	2.6%
7	286	2.9%
8	206	2.1%
9	210	2.1%

Value	Count	Frequency (%)
39	3	< 0.1%
38	1	< 0.1%
37	2	< 0.1%
36	1	< 0.1%
35	3	< 0.1%
34	3	< 0.1%
33	3	< 0.1%
32	1	< 0.1%
31	4	< 0.1%
30	4	< 0.1%

작업_일자"
Categorical

CONSTANT

Distinct	1
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

20111227	10000

Length

Max length	8
Median length	8
Mean length	8
Min length	8

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	20111227
2nd row	20111227
3rd row	20111227
4th row	20111227
5th row	20111227

Common Values

Value	Count	Frequency (%)
20111227	10000	100.0%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
20111227	10000	100.0%

층_번호

층_번호

Phik (φk)
Auto

Heatmap
Table

	층_구분_코드	층_번호
층_구분_코드	1.000	0.155
층_번호	0.155	1.000

Heatmap
Table

	층_번호	층_구분_코드
층_번호	1.000	0.119
층_구분_코드	0.119	1.000

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

First rows
Last rows

	관리_폐쇄말소대장_PK	동명칭	호_명	층_구분_코드	층_번호	작업_일자"
39687	11620-12907	109동	802	20	8	20111227
11888	11470-13534	가동	303호	20	3	20111227
52433	11620-20385	<NA>	B01호	20	0	20111227
7989	11500-23424	5동	312	20	0	20111227
64074	11650-15591	223동	306호	20	0	20111227
55659	11650-21515	1동	303호	20	0	20111227
67089	11620-16219	102동	301호	20	0	20111227
19312	11440-22980	104동	1004호	20	10	20111227
11679	11500-100017854	상가1동	B126	10	1	20111227
49191	11410-10054	<NA>	202호	20	0	20111227

	관리_폐쇄말소대장_PK	동명칭	호_명	층_구분_코드	층_번호	작업_일자"
48398	11440-14578	<NA>	101	20	1	20111227
36905	11500-19206	32동	2층205호	20	0	20111227
43185	11590-14121	104동	1103	20	11	20111227
16635	11470-7046	<NA>	202호	20	2	20111227
47123	11590-17375	5동	101호	20	0	20111227
40170	11620-13651	117동	2102	20	21	20111227
41650	11590-11755	107동	603호	20	6	20111227
63170	11650-12767	<NA>	301호	20	0	20111227
22308	11545-100009514	<NA>	2-1호	20	2	20111227
6168	11380-100021974	<NA>	3층2	20	1	20111227

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Dash Punctuation

Most occurring scripts

Most frequent character per script

Common

Most occurring blocks

Most frequent character per block

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Decimal Number

Uppercase Letter

Other Punctuation

Space Separator

Close Punctuation

Open Punctuation

Dash Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Common

Latin

Most occurring blocks

Most frequent character per block

Hangul

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Decimal Number

Uppercase Letter

Lowercase Letter

Other Punctuation

Dash Punctuation

Space Separator

Close Punctuation

Open Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Common

Latin

Most occurring blocks

Most frequent character per block

ASCII

Hangul

Common Values

Length

Common Values (Plot)

Common Values

Length

Common Values (Plot)

Interactions

Correlations

Missing values

Sample