gimi9 Pandas Profiling

Dataset statistics

Number of variables	6
Number of observations	5670
Missing cells	1
Missing cells (%)	< 0.1%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	282.5 KiB
Average record size in memory	51.0 B

Variable types

Text	2
Categorical	4

Dataset

Description	지하수 수질측정망 제원에 대한 내용입니다. - 주소, 관측소명, 관정구분, 지하수용도코드, 음용여부, 구분 등을 제공합니다. * 지하수 관련 사이트는 www.gims.go.kr 을 참고하여주시기 바랍니다.
URL	https://www.data.go.kr/data/15104449/fileData.do

Alerts

`관정구분` is highly overall correlated with `구분`	High correlation
`구분` is highly overall correlated with `관정구분`	High correlation
`관정구분` is highly imbalanced (66.5%)	Imbalance

Reproduction

Analysis started	2023-12-11 23:58:20.779827
Analysis finished	2023-12-11 23:58:21.531512
Duration	0.75 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

주소
Text

Distinct	4714
Distinct (%)	83.1%
Missing	0
Missing (%)	0.0%
Memory size	44.4 KiB

Length

Max length	31
Median length	29
Mean length	20.150617
Min length	4

Characters and Unicode

Total characters	114254
Distinct characters	372
Distinct categories	7 ?
Distinct scripts	4 ?
Distinct blocks	3 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	3928 ?
Unique (%)	69.3%

Sample

1st row	전라북도 임실군 덕치면 장암리 산301
2nd row	충청남도 예산군 예산읍 주교리 420
3rd row	강원도 홍천군 서면 모곡리 산234-4
4th row	충청북도 충주시 중앙탑면 가흥리 582
5th row	충청북도 충주시 동량면 조동리 1370-4

Value	Count	Frequency (%)
경기도	805	3.1%
경상북도	619	2.4%
경상남도	532	2.1%
전라남도	501	2.0%
강원도	458	1.8%
전라북도	413	1.6%
충청남도	404	1.6%
대구광역시	401	1.6%
충청북도	355	1.4%
서울특별시	304	1.2%
Other values (6734)	20858	81.3%

Most occurring characters

Value	Count	Frequency (%)
	22910	20.1%
도	4395	3.8%
1	4217	3.7%
시	3993	3.5%
-	3513	3.1%
동	3381	3.0%
리	3062	2.7%
2	2786	2.4%
구	2395	2.1%
3	2242	2.0%
Other values (362)	61360	53.7%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	67205	58.8%
Space Separator	22910	20.1%
Decimal Number	20604	18.0%
Dash Punctuation	3513	3.1%
Uppercase Letter	10	< 0.1%
Open Punctuation	6	< 0.1%
Close Punctuation	6	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
도	4395	6.5%
시	3993	5.9%
동	3381	5.0%
리	3062	4.6%
구	2395	3.6%
경	2087	3.1%
남	2078	3.1%
면	2040	3.0%
군	2035	3.0%
산	1970	2.9%
Other values (342)	39769	59.2%

Decimal Number

Value	Count	Frequency (%)
1	4217	20.5%
2	2786	13.5%
3	2242	10.9%
4	1991	9.7%
5	1917	9.3%
6	1656	8.0%
7	1515	7.4%
0	1447	7.0%
8	1434	7.0%
9	1399	6.8%

Uppercase Letter

Value	Count	Frequency (%)
B	3	30.0%
N	3	30.0%
L	1	10.0%
T	1	10.0%
P	1	10.0%
A	1	10.0%

Space Separator

Value	Count	Frequency (%)
	22910	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	3513	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	6	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	6	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	67201	58.8%
Common	47039	41.2%
Latin	10	< 0.1%
Han	4	< 0.1%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
도	4395	6.5%
시	3993	5.9%
동	3381	5.0%
리	3062	4.6%
구	2395	3.6%
경	2087	3.1%
남	2078	3.1%
면	2040	3.0%
군	2035	3.0%
산	1970	2.9%
Other values (340)	39765	59.2%

Common

Value	Count	Frequency (%)
	22910	48.7%
1	4217	9.0%
-	3513	7.5%
2	2786	5.9%
3	2242	4.8%
4	1991	4.2%
5	1917	4.1%
6	1656	3.5%
7	1515	3.2%
0	1447	3.1%
Other values (4)	2845	6.0%

Latin

Value	Count	Frequency (%)
B	3	30.0%
N	3	30.0%
L	1	10.0%
T	1	10.0%
P	1	10.0%
A	1	10.0%

Han

Value	Count	Frequency (%)
華	2	50.0%
山	2	50.0%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	67201	58.8%
ASCII	47049	41.2%
CJK	4	< 0.1%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
	22910	48.7%
1	4217	9.0%
-	3513	7.5%
2	2786	5.9%
3	2242	4.8%
4	1991	4.2%
5	1917	4.1%
6	1656	3.5%
7	1515	3.2%
0	1447	3.1%
Other values (10)	2855	6.1%

Hangul

Value	Count	Frequency (%)
도	4395	6.5%
시	3993	5.9%
동	3381	5.0%
리	3062	4.6%
구	2395	3.6%
경	2087	3.1%
남	2078	3.1%
면	2040	3.0%
군	2035	3.0%
산	1970	2.9%
Other values (340)	39765	59.2%

CJK

Value	Count	Frequency (%)
華	2	50.0%
山	2	50.0%

관측소명
Text

Distinct	2916
Distinct (%)	51.4%
Missing	1
Missing (%)	< 0.1%
Memory size	44.4 KiB

Length

Max length	13
Median length	4
Mean length	4.1718116
Min length	2

Characters and Unicode

Total characters	23650
Distinct characters	384
Distinct categories	9 ?
Distinct scripts	3 ?
Distinct blocks	3 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	1610 ?
Unique (%)	28.4%

Sample

1st row	임실덕치_신
2nd row	예산예산
3rd row	홍천서면
4th row	충주가금
5th row	충주동량

Value	Count	Frequency (%)
자동관측정	72	1.3%
남구대명	15	0.3%
달성논공	13	0.2%
울산서하	12	0.2%
평창평창	12	0.2%
순창순창	11	0.2%
옥천옥천	11	0.2%
서구평리	11	0.2%
신안지도	11	0.2%
여주점동	11	0.2%
Other values (2928)	5527	96.9%

Most occurring characters

Value	Count	Frequency (%)
산	944	4.0%
천	840	3.6%
주	755	3.2%
구	710	3.0%
성	709	3.0%
동	675	2.9%
양	529	2.2%
안	520	2.2%
남	470	2.0%
서	389	1.6%
Other values (374)	17109	72.3%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	22580	95.5%
Decimal Number	970	4.1%
Space Separator	40	0.2%
Connector Punctuation	25	0.1%
Dash Punctuation	21	0.1%
Uppercase Letter	7	< 0.1%
Other Symbol	3	< 0.1%
Open Punctuation	2	< 0.1%
Close Punctuation	2	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
산	944	4.2%
천	840	3.7%
주	755	3.3%
구	710	3.1%
성	709	3.1%
동	675	3.0%
양	529	2.3%
안	520	2.3%
남	470	2.1%
서	389	1.7%
Other values (355)	16039	71.0%

Decimal Number

Value	Count	Frequency (%)
1	328	33.8%
2	267	27.5%
3	248	25.6%
4	62	6.4%
5	27	2.8%
6	19	2.0%
8	12	1.2%
7	4	0.4%
9	3	0.3%

Uppercase Letter

Value	Count	Frequency (%)
S	2	28.6%
K	2	28.6%
C	2	28.6%
A	1	14.3%

Space Separator

Value	Count	Frequency (%)
	40	100.0%

Connector Punctuation

Value	Count	Frequency (%)
_	25	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	21	100.0%

Other Symbol

Value	Count	Frequency (%)
㈜	3	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	2	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	2	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	22583	95.5%
Common	1060	4.5%
Latin	7	< 0.1%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
산	944	4.2%
천	840	3.7%
주	755	3.3%
구	710	3.1%
성	709	3.1%
동	675	3.0%
양	529	2.3%
안	520	2.3%
남	470	2.1%
서	389	1.7%
Other values (356)	16042	71.0%

Common

Value	Count	Frequency (%)
1	328	30.9%
2	267	25.2%
3	248	23.4%
4	62	5.8%
	40	3.8%
5	27	2.5%
_	25	2.4%
-	21	2.0%
6	19	1.8%
8	12	1.1%
Other values (4)	11	1.0%

Latin

Value	Count	Frequency (%)
S	2	28.6%
K	2	28.6%
C	2	28.6%
A	1	14.3%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	22580	95.5%
ASCII	1067	4.5%
None	3	< 0.1%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
산	944	4.2%
천	840	3.7%
주	755	3.3%
구	710	3.1%
성	709	3.1%
동	675	3.0%
양	529	2.3%
안	520	2.3%
남	470	2.1%
서	389	1.7%
Other values (355)	16039	71.0%

ASCII

Value	Count	Frequency (%)
1	328	30.7%
2	267	25.0%
3	248	23.2%
4	62	5.8%
	40	3.7%
5	27	2.5%
_	25	2.3%
-	21	2.0%
6	19	1.8%
8	12	1.1%
Other values (8)	18	1.7%

None

Value	Count	Frequency (%)
㈜	3	100.0%

관정구분
Categorical

HIGH CORRELATION IMBALANCE

Distinct	6
Distinct (%)	0.1%
Missing	0
Missing (%)	0.0%
Memory size	44.4 KiB

<NA>	4764
1	594
2	199
5	43
6	35

Length

Max length	4
Median length	4
Mean length	3.5206349
Min length	1

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	2
2nd row	1
3rd row	1
4th row	1
5th row	1

Common Values

Value	Count	Frequency (%)
<NA>	4764	84.0%
1	594	10.5%
2	199	3.5%
5	43	0.8%
6	35	0.6%
4	35	0.6%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
na	4764	84.0%
1	594	10.5%
2	199	3.5%
5	43	0.8%
6	35	0.6%
4	35	0.6%

지하수용도코드
Categorical

Distinct	5
Distinct (%)	0.1%
Missing	0
Missing (%)	0.0%
Memory size	44.4 KiB

1	4246
<NA>	612
2	486
3	322
4	4

Length

Max length	4
Median length	1
Mean length	1.3238095
Min length	1

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	1
2nd row	1
3rd row	1
4th row	1
5th row	1

Common Values

Value	Count	Frequency (%)
1	4246	74.9%
<NA>	612	10.8%
2	486	8.6%
3	322	5.7%
4	4	0.1%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
1	4246	74.9%
na	612	10.8%
2	486	8.6%
3	322	5.7%
4	4	0.1%

음용여부
Categorical

Distinct	3
Distinct (%)	0.1%
Missing	0
Missing (%)	0.0%
Memory size	44.4 KiB

0	2777
1	1666
<NA>	1227

Length

Max length	4
Median length	1
Mean length	1.6492063
Min length	1

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	<NA>
2nd row	<NA>
3rd row	<NA>
4th row	<NA>
5th row	<NA>

Common Values

Value	Count	Frequency (%)
0	2777	49.0%
1	1666	29.4%
<NA>	1227	21.6%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
0	2777	49.0%
1	1666	29.4%
na	1227	21.6%

구분
Categorical

HIGH CORRELATION

Distinct	3
Distinct (%)	0.1%
Missing	0
Missing (%)	0.0%
Memory size	44.4 KiB

일반지역	2846
오염우려지역	2146
국가관측망	678

Length

Max length	6
Median length	4
Mean length	4.8765432
Min length	4

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	국가관측망
2nd row	국가관측망
3rd row	국가관측망
4th row	국가관측망
5th row	국가관측망

Common Values

Value	Count	Frequency (%)
일반지역	2846	50.2%
오염우려지역	2146	37.8%
국가관측망	678	12.0%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
일반지역	2846	50.2%
오염우려지역	2146	37.8%
국가관측망	678	12.0%

Heatmap
Table

	관정구분	지하수용도코드	음용여부	구분
관정구분	1.000	0.000	0.121	0.561
지하수용도코드	0.000	1.000	0.471	0.295
음용여부	0.121	0.471	1.000	0.180
구분	0.561	0.295	0.180	1.000

Heatmap
Table

	구분	관정구분	지하수용도코드	음용여부
구분	1.000	0.505	0.284	0.297
관정구분	0.505	1.000	0.000	0.077
지하수용도코드	0.284	0.000	1.000	0.318
음용여부	0.297	0.077	0.318	1.000

Heatmap
Table

	관정구분	지하수용도코드	음용여부	구분
관정구분	1.000	0.000	0.077	0.505
지하수용도코드	0.000	1.000	0.318	0.284
음용여부	0.077	0.318	1.000	0.297
구분	0.505	0.284	0.297	1.000

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows
Last rows

	주소	관측소명	관정구분	지하수용도코드	음용여부	구분
0	전라북도 임실군 덕치면 장암리 산301	임실덕치_신	2	1	<NA>	국가관측망
1	충청남도 예산군 예산읍 주교리 420	예산예산	1	1	<NA>	국가관측망
2	강원도 홍천군 서면 모곡리 산234-4	홍천서면	1	1	<NA>	국가관측망
3	충청북도 충주시 중앙탑면 가흥리 582	충주가금	1	1	<NA>	국가관측망
4	충청북도 충주시 동량면 조동리 1370-4	충주동량	1	1	<NA>	국가관측망
5	강원도 춘천시 북산면 추곡리 108-1	춘천북산	1	1	<NA>	국가관측망
6	경상남도 창녕군 영산면 죽사리 1456-41	창녕영산	1	1	<NA>	국가관측망
7	전라북도 남원시 도통동 554	남원도통	1	1	<NA>	국가관측망
8	충청북도 옥천군 청성면 묘금리 19-1	옥천청성	1	1	<NA>	국가관측망
9	경상북도 경주시 외동읍 활성리 948-1	경주외동	2	1	<NA>	국가관측망

	주소	관측소명	관정구분	지하수용도코드	음용여부	구분
5660	경상북도 상주시 공검면 양정리 898	상주2	1	<NA>	0	오염우려지역
5661		태창광산	<NA>	1	0	오염우려지역
5662	경상남도 함양군 병곡면 송평리 628-3	함양병곡	2	1	1	국가관측망
5663	경상남도 창원시 성산구 반림동 6-4	창원반림	<NA>	1	1	오염우려지역
5664	경상남도 창원시 반림동 6-4	창원반림	<NA>	1	1	오염우려지역
5665	전라남도 완도군 완도읍 가용리 172	완도가용	5	<NA>	<NA>	오염우려지역
5666	울산광역시 울주군 두서면 서하리 86	울산서하	6	<NA>	<NA>	오염우려지역
5667	경상북도 청도군 청도읍 신도리 50-2	청도신도	4	<NA>	<NA>	오염우려지역
5668	경기도 안성시 미양면 계륵리 268	안성미양	<NA>	2	0	오염우려지역
5669	경기도 안성시 신건지동 60-1	안성신건	<NA>	2	0	오염우려지역

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Decimal Number

Uppercase Letter

Space Separator

Dash Punctuation

Open Punctuation

Close Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Common

Latin

Han

Most occurring blocks

Most frequent character per block

ASCII

Hangul

CJK

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Decimal Number

Uppercase Letter

Space Separator

Connector Punctuation

Dash Punctuation

Other Symbol

Open Punctuation

Close Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Common

Latin

Most occurring blocks

Most frequent character per block

Hangul

ASCII

None

Common Values

Length

Common Values (Plot)

Common Values

Length

Common Values (Plot)

Common Values

Length

Common Values (Plot)

Common Values

Length

Common Values (Plot)

Correlations

Missing values

Sample