gimi9 Pandas Profiling

Dataset statistics

Number of variables	4
Number of observations	10000
Missing cells	3825
Missing cells (%)	9.6%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	390.6 KiB
Average record size in memory	40.0 B

Variable types

Text	3
Categorical	1

Dataset

Description	관리_건축물대장_PK,동명칭,호_명,층_구분_코드
Author	서울특별시
URL	https://data.seoul.go.kr/dataList/OA-15393/S/1/datasetView.do

Alerts

`층_구분_코드` is highly imbalanced (86.5%)	Imbalance
`동명칭` has 3816 (38.2%) missing values	Missing
`관리_건축물대장_PK` has unique values	Unique

Reproduction

Analysis started	2024-05-18 03:51:22.758348
Analysis finished	2024-05-18 03:51:24.850108
Duration	2.09 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

관리_건축물대장_PK
Text

UNIQUE

Distinct	10000
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	28
Median length	11
Mean length	12.8371
Min length	11

Characters and Unicode

Total characters	128371
Distinct characters	11
Distinct categories	2 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	10000 ?
Unique (%)	100.0%

Sample

1st row	11320-20542
2nd row	11530-100243033
3rd row	11380-100182917
4th row	11170-75974
5th row	11350-92272

Value	Count	Frequency (%)
11320-20542	1	< 0.1%
11410-91737	1	< 0.1%
11710-74309	1	< 0.1%
11440-49148	1	< 0.1%
11170-43779	1	< 0.1%
11710-74768	1	< 0.1%
11350-52486	1	< 0.1%
11380-111671	1	< 0.1%
11590-100201159	1	< 0.1%
11260-100270427	1	< 0.1%
Other values (9990)	9990	99.9%

Most occurring characters

Value	Count	Frequency (%)
1	31827	24.8%
0	25347	19.7%
-	10000	7.8%
5	9838	7.7%
3	9738	7.6%
2	9385	7.3%
4	6899	5.4%
8	6766	5.3%
6	6474	5.0%
7	6453	5.0%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	118371	92.2%
Dash Punctuation	10000	7.8%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
1	31827	26.9%
0	25347	21.4%
5	9838	8.3%
3	9738	8.2%
2	9385	7.9%
4	6899	5.8%
8	6766	5.7%
6	6474	5.5%
7	6453	5.5%
9	5644	4.8%

Dash Punctuation

Value	Count	Frequency (%)
-	10000	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	128371	100.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
1	31827	24.8%
0	25347	19.7%
-	10000	7.8%
5	9838	7.7%
3	9738	7.6%
2	9385	7.3%
4	6899	5.4%
8	6766	5.3%
6	6474	5.0%
7	6453	5.0%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	128371	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
1	31827	24.8%
0	25347	19.7%
-	10000	7.8%
5	9838	7.7%
3	9738	7.6%
2	9385	7.3%
4	6899	5.4%
8	6766	5.3%
6	6474	5.0%
7	6453	5.0%

동명칭
Text

MISSING

Distinct	778
Distinct (%)	12.6%
Missing	3816
Missing (%)	38.2%
Memory size	156.2 KiB

Length

Max length	26
Median length	4
Mean length	4.1235446
Min length	1

Characters and Unicode

Total characters	25500
Distinct characters	335
Distinct categories	10 ?
Distinct scripts	3 ?
Distinct blocks	3 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	379 ?
Unique (%)	6.1%

Sample

1st row	111동
2nd row	804동
3rd row	811동
4th row	104동
5th row	204동

Value	Count	Frequency (%)
101동	505	7.8%
102동	376	5.8%
103동	237	3.7%
104동	223	3.5%
105동	217	3.4%
106동	216	3.4%
108동	115	1.8%
110동	92	1.4%
109동	90	1.4%
203동	84	1.3%
Other values (832)	4279	66.5%

Most occurring characters

Value	Count	Frequency (%)
동	5602	22.0%
1	5039	19.8%
0	3914	15.3%
2	1685	6.6%
3	1174	4.6%
4	943	3.7%
5	729	2.9%
6	684	2.7%
8	520	2.0%
7	367	1.4%
Other values (325)	4843	19.0%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	15361	60.2%
Other Letter	9329	36.6%
Uppercase Letter	407	1.6%
Space Separator	250	1.0%
Close Punctuation	43	0.2%
Open Punctuation	43	0.2%
Lowercase Letter	32	0.1%
Dash Punctuation	25	0.1%
Other Punctuation	8	< 0.1%
Letter Number	2	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
동	5602	60.0%
가	219	2.3%
빌	166	1.8%
상	147	1.6%
스	129	1.4%
리	94	1.0%
아	90	1.0%
트	89	1.0%
이	88	0.9%
주	73	0.8%
Other values (280)	2632	28.2%

Uppercase Letter

Value	Count	Frequency (%)
A	76	18.7%
B	55	13.5%
T	50	12.3%
S	25	6.1%
E	25	6.1%
W	23	5.7%
V	20	4.9%
R	20	4.9%
O	18	4.4%
I	15	3.7%
Other values (12)	80	19.7%

Decimal Number

Value	Count	Frequency (%)
1	5039	32.8%
0	3914	25.5%
2	1685	11.0%
3	1174	7.6%
4	943	6.1%
5	729	4.7%
6	684	4.5%
8	520	3.4%
7	367	2.4%
9	306	2.0%

Lowercase Letter

Value	Count	Frequency (%)
l	14	43.8%
e	6	18.8%
z	6	18.8%
i	6	18.8%

Other Punctuation

Value	Count	Frequency (%)
.	6	75.0%
&	1	12.5%
,	1	12.5%

Letter Number

Value	Count	Frequency (%)
Ⅴ	1	50.0%
Ⅱ	1	50.0%

Space Separator

Value	Count	Frequency (%)
	250	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	43	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	43	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	25	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	15730	61.7%
Hangul	9329	36.6%
Latin	441	1.7%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
동	5602	60.0%
가	219	2.3%
빌	166	1.8%
상	147	1.6%
스	129	1.4%
리	94	1.0%
아	90	1.0%
트	89	1.0%
이	88	0.9%
주	73	0.8%
Other values (280)	2632	28.2%

Latin

Value	Count	Frequency (%)
A	76	17.2%
B	55	12.5%
T	50	11.3%
S	25	5.7%
E	25	5.7%
W	23	5.2%
V	20	4.5%
R	20	4.5%
O	18	4.1%
I	15	3.4%
Other values (18)	114	25.9%

Common

Value	Count	Frequency (%)
1	5039	32.0%
0	3914	24.9%
2	1685	10.7%
3	1174	7.5%
4	943	6.0%
5	729	4.6%
6	684	4.3%
8	520	3.3%
7	367	2.3%
9	306	1.9%
Other values (7)	369	2.3%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	16169	63.4%
Hangul	9329	36.6%
Number Forms	2	< 0.1%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
동	5602	60.0%
가	219	2.3%
빌	166	1.8%
상	147	1.6%
스	129	1.4%
리	94	1.0%
아	90	1.0%
트	89	1.0%
이	88	0.9%
주	73	0.8%
Other values (280)	2632	28.2%

ASCII

Value	Count	Frequency (%)
1	5039	31.2%
0	3914	24.2%
2	1685	10.4%
3	1174	7.3%
4	943	5.8%
5	729	4.5%
6	684	4.2%
8	520	3.2%
7	367	2.3%
9	306	1.9%
Other values (33)	808	5.0%

Number Forms

Value	Count	Frequency (%)
Ⅴ	1	50.0%
Ⅱ	1	50.0%

호_명
Text

Distinct	1822
Distinct (%)	18.2%
Missing	9
Missing (%)	0.1%
Memory size	156.2 KiB

Length

Max length	13
Median length	12
Mean length	3.9887899
Min length	1

Characters and Unicode

Total characters	39852
Distinct characters	80
Distinct categories	10 ?
Distinct scripts	3 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	1156 ?
Unique (%)	11.6%

Sample

1st row	605호
2nd row	1004
3rd row	708
4th row	102호
5th row	809호

Value	Count	Frequency (%)
301	211	2.1%
401	189	1.9%
201	187	1.9%
202	166	1.7%
302	159	1.6%
402	155	1.5%
501	148	1.5%
201호	136	1.4%
101	130	1.3%
301호	116	1.2%
Other values (1779)	8456	84.1%

Most occurring characters

Value	Count	Frequency (%)
0	9331	23.4%
1	7517	18.9%
호	4909	12.3%
2	4449	11.2%
3	3027	7.6%
4	2400	6.0%
5	1902	4.8%
6	1371	3.4%
7	1147	2.9%
8	868	2.2%
Other values (70)	2931	7.4%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	32842	82.4%
Other Letter	6192	15.5%
Uppercase Letter	381	1.0%
Dash Punctuation	324	0.8%
Space Separator	62	0.2%
Open Punctuation	18	< 0.1%
Close Punctuation	18	< 0.1%
Connector Punctuation	8	< 0.1%
Other Punctuation	6	< 0.1%
Lowercase Letter	1	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
호	4909	79.3%
층	611	9.9%
지	188	3.0%
동	107	1.7%
하	53	0.9%
아	36	0.6%
오	32	0.5%
비	26	0.4%
상	24	0.4%
가	24	0.4%
Other values (36)	182	2.9%

Uppercase Letter

Value	Count	Frequency (%)
B	188	49.3%
A	80	21.0%
S	20	5.2%
E	19	5.0%
T	17	4.5%
W	12	3.1%
C	11	2.9%
F	10	2.6%
O	9	2.4%
G	4	1.0%
Other values (5)	11	2.9%

Decimal Number

Value	Count	Frequency (%)
0	9331	28.4%
1	7517	22.9%
2	4449	13.5%
3	3027	9.2%
4	2400	7.3%
5	1902	5.8%
6	1371	4.2%
7	1147	3.5%
8	868	2.6%
9	830	2.5%

Other Punctuation

Value	Count	Frequency (%)
.	4	66.7%
:	1	16.7%
,	1	16.7%

Dash Punctuation

Value	Count	Frequency (%)
-	324	100.0%

Space Separator

Value	Count	Frequency (%)
	62	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	18	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	18	100.0%

Connector Punctuation

Value	Count	Frequency (%)
_	8	100.0%

Lowercase Letter

Value	Count	Frequency (%)
b	1	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	33278	83.5%
Hangul	6192	15.5%
Latin	382	1.0%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
호	4909	79.3%
층	611	9.9%
지	188	3.0%
동	107	1.7%
하	53	0.9%
아	36	0.6%
오	32	0.5%
비	26	0.4%
상	24	0.4%
가	24	0.4%
Other values (36)	182	2.9%

Common

Value	Count	Frequency (%)
0	9331	28.0%
1	7517	22.6%
2	4449	13.4%
3	3027	9.1%
4	2400	7.2%
5	1902	5.7%
6	1371	4.1%
7	1147	3.4%
8	868	2.6%
9	830	2.5%
Other values (8)	436	1.3%

Latin

Value	Count	Frequency (%)
B	188	49.2%
A	80	20.9%
S	20	5.2%
E	19	5.0%
T	17	4.5%
W	12	3.1%
C	11	2.9%
F	10	2.6%
O	9	2.4%
G	4	1.0%
Other values (6)	12	3.1%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	33660	84.5%
Hangul	6192	15.5%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	9331	27.7%
1	7517	22.3%
2	4449	13.2%
3	3027	9.0%
4	2400	7.1%
5	1902	5.7%
6	1371	4.1%
7	1147	3.4%
8	868	2.6%
9	830	2.5%
Other values (24)	818	2.4%

Hangul

Value	Count	Frequency (%)
호	4909	79.3%
층	611	9.9%
지	188	3.0%
동	107	1.7%
하	53	0.9%
아	36	0.6%
오	32	0.5%
비	26	0.4%
상	24	0.4%
가	24	0.4%
Other values (36)	182	2.9%

층_구분_코드
Categorical

IMBALANCE

Distinct	3
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

지상	9661
지하	338
옥탑	1

Length

Max length	2
Median length	2
Mean length	2
Min length	2

Unique

Unique	1 ?
Unique (%)	< 0.1%

Sample

1st row	지상
2nd row	지상
3rd row	지상
4th row	지상
5th row	지상

Common Values

Value	Count	Frequency (%)
지상	9661	96.6%
지하	338	3.4%
옥탑	1	< 0.1%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
지상	9661	96.6%
지하	338	3.4%
옥탑	1	< 0.1%

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

First rows
Last rows

	관리_건축물대장_PK	동명칭	호_명	층_구분_코드
32179	11320-20542	111동	605호	지상
73807	11530-100243033	804동	1004	지상
61660	11380-100182917	811동	708	지상
69431	11170-75974	<NA>	102호	지상
39957	11350-92272	104동	809호	지상
9880	11590-95572	204동	606	지상
42331	11590-100219793	에이동	202	지상
69308	11170-79517	(2단지)	202-2705	지상
54539	11350-91206	10동	507호	지상
39483	11440-58822	<NA>	아-502	지상

	관리_건축물대장_PK	동명칭	호_명	층_구분_코드
88217	11380-100200182	332동	1002	지상
32551	11470-89216	<NA>	402호	지상
29091	11350-95910	203동	401호	지상
59883	11170-64086	<NA>	209호	지상
21739	11710-152434	상가	3층1호	지상
67417	11470-113238	106동	309호	지상
67990	11230-100256270	<NA>	604	지상
80807	11230-100181318	<NA>	501	지상
75080	11380-100199474	317동	605	지상
65976	11560-69081	<NA>	1층마-8호	지상

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Dash Punctuation

Most occurring scripts

Most frequent character per script

Common

Most occurring blocks

Most frequent character per block

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Uppercase Letter

Decimal Number

Lowercase Letter

Other Punctuation

Letter Number

Space Separator

Close Punctuation

Open Punctuation

Dash Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Latin

Common

Most occurring blocks

Most frequent character per block

Hangul

ASCII

Number Forms

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Uppercase Letter

Decimal Number

Other Punctuation

Dash Punctuation

Space Separator

Open Punctuation

Close Punctuation

Connector Punctuation

Lowercase Letter

Most occurring scripts

Most frequent character per script

Hangul

Common

Latin

Most occurring blocks

Most frequent character per block

ASCII

Hangul

Common Values

Length

Common Values (Plot)

Missing values

Sample