gimi9 Pandas Profiling

Dataset statistics

Number of variables	6
Number of observations	10000
Missing cells	19837
Missing cells (%)	33.1%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	556.6 KiB
Average record size in memory	57.0 B

Variable types

Numeric	1
Text	3
Categorical	2

Dataset

Description	창원시 빅데이터시스템의 민원통계분석용 긍정, 부정 등 키워드 목록입니다. 항목은 연번, 키워드, 구분(불용어, 긍정) 의 목록입니다.
Author	경상남도 창원시
URL	https://bigdata.gyeongnam.go.kr/index.gn?menuCd=DOM_000000114002001000&publicdatapk=15063986

Alerts

`연번` is highly overall correlated with `TYPE` and 1 other fields	High correlation
`TYPE` is highly overall correlated with `연번`	High correlation
`긍부정구분` is highly overall correlated with `연번`	High correlation
`TYPE` is highly imbalanced (94.8%)	Imbalance
`긍부정구분` is highly imbalanced (96.7%)	Imbalance
`WORD` has 9890 (98.9%) missing values	Missing
`단어` has 9947 (99.5%) missing values	Missing
`연번` has unique values	Unique
`KEYWORD` has unique values	Unique

Reproduction

Analysis started	2023-12-10 23:22:38.702028
Analysis finished	2023-12-10 23:22:39.815826
Duration	1.11 second
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

연번
Real number (ℝ)

HIGH CORRELATION UNIQUE

Distinct	10000
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Mean	431124.37

Minimum	19
Maximum	1078995
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	166.0 KiB

Quantile statistics

Minimum	19
5-th percentile	13663.05
Q1	119448.75
median	374967
Q3	709924.25
95-th percentile	972240.5
Maximum	1078995
Range	1078976
Interquartile range (IQR)	590475.5

Descriptive statistics

Standard deviation	327707.04
Coefficient of variation (CV)	0.76012182
Kurtosis	-1.1976801
Mean	431124.37
Median Absolute Deviation (MAD)	283603.5
Skewness	0.33501269
Sum	4.3112437 × 10⁹
Variance	1.0739191 × 10¹¹
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
269389	1	< 0.1%
382596	1	< 0.1%
87018	1	< 0.1%
43619	1	< 0.1%
11452	1	< 0.1%
193262	1	< 0.1%
537998	1	< 0.1%
215896	1	< 0.1%
42454	1	< 0.1%
724469	1	< 0.1%
Other values (9990)	9990	99.9%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
19	1	< 0.1%
20	1	< 0.1%
50	1	< 0.1%
59	1	< 0.1%
60	1	< 0.1%
69	1	< 0.1%
75	1	< 0.1%
88	1	< 0.1%
92	1	< 0.1%
95	1	< 0.1%

Value	Count	Frequency (%)
1078995	1	< 0.1%
1078983	1	< 0.1%
1078897	1	< 0.1%
1078361	1	< 0.1%
1078360	1	< 0.1%
1078337	1	< 0.1%
1078304	1	< 0.1%
1078262	1	< 0.1%
1078145	1	< 0.1%
1078140	1	< 0.1%

KEYWORD
Text

UNIQUE

Distinct	10000
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	33
Median length	18
Mean length	4.5212
Min length	2

Characters and Unicode

Total characters	45212
Distinct characters	1170
Distinct categories	2 ?
Distinct scripts	2 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	10000 ?
Unique (%)	100.0%

Sample

1st row	인력부족
2nd row	회차로변
3rd row	좋치않
4th row	토석채취허
5th row	개판오분직전

Value	Count	Frequency (%)
잘	2	< 0.1%
못	2	< 0.1%
아무	2	< 0.1%
혼자사	1	< 0.1%
도로소통	1	< 0.1%
창워시	1	< 0.1%
설치를해야된다	1	< 0.1%
뒷길	1	< 0.1%
동선리	1	< 0.1%
양덕동메트로시티	1	< 0.1%
Other values (9993)	9993	99.9%

Most occurring characters

Value	Count	Frequency (%)
다	949	2.1%
하	841	1.9%
지	781	1.7%
이	729	1.6%
는	682	1.5%
니	560	1.2%
고	532	1.2%
시	530	1.2%
가	488	1.1%
주	475	1.1%
Other values (1160)	38645	85.5%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	45206	> 99.9%
Space Separator	6	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
다	949	2.1%
하	841	1.9%
지	781	1.7%
이	729	1.6%
는	682	1.5%
니	560	1.2%
고	532	1.2%
시	530	1.2%
가	488	1.1%
주	475	1.1%
Other values (1159)	38639	85.5%

Space Separator

Value	Count	Frequency (%)
	6	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	45206	> 99.9%
Common	6	< 0.1%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
다	949	2.1%
하	841	1.9%
지	781	1.7%
이	729	1.6%
는	682	1.5%
니	560	1.2%
고	532	1.2%
시	530	1.2%
가	488	1.1%
주	475	1.1%
Other values (1159)	38639	85.5%

Common

Value	Count	Frequency (%)
	6	100.0%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	45206	> 99.9%
ASCII	6	< 0.1%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
다	949	2.1%
하	841	1.9%
지	781	1.7%
이	729	1.6%
는	682	1.5%
니	560	1.2%
고	532	1.2%
시	530	1.2%
가	488	1.1%
주	475	1.1%
Other values (1159)	38639	85.5%

ASCII

Value	Count	Frequency (%)
	6	100.0%

WORD
Text

MISSING

Distinct	110
Distinct (%)	100.0%
Missing	9890
Missing (%)	98.9%
Memory size	156.2 KiB

Length

Max length	8
Median length	2
Mean length	2.5
Min length	2

Characters and Unicode

Total characters	275
Distinct characters	186
Distinct categories	3 ?
Distinct scripts	2 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	110 ?
Unique (%)	100.0%

Sample

1st row	QKF
2nd row	시도새올
3rd row	고객만족도
4th row	어물
5th row	아니다

Value	Count	Frequency (%)
이런건	1	0.9%
호기심	1	0.9%
정직	1	0.9%
위반	1	0.9%
창의적	1	0.9%
업체	1	0.9%
때문	1	0.9%
답변	1	0.9%
공개	1	0.9%
판사	1	0.9%
Other values (100)	100	90.9%

Most occurring characters

Value	Count	Frequency (%)
이	5	1.8%
하	5	1.8%
심	5	1.8%
니	4	1.5%
사	4	1.5%
기	4	1.5%
의	4	1.5%
정	4	1.5%
도	4	1.5%
나	3	1.1%
Other values (176)	233	84.7%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	269	97.8%
Lowercase Letter	3	1.1%
Uppercase Letter	3	1.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
이	5	1.9%
하	5	1.9%
심	5	1.9%
니	4	1.5%
사	4	1.5%
기	4	1.5%
의	4	1.5%
정	4	1.5%
도	4	1.5%
나	3	1.1%
Other values (170)	227	84.4%

Lowercase Letter

Value	Count	Frequency (%)
o	1	33.3%
m	1	33.3%
c	1	33.3%

Uppercase Letter

Value	Count	Frequency (%)
Q	1	33.3%
K	1	33.3%
F	1	33.3%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	269	97.8%
Latin	6	2.2%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
이	5	1.9%
하	5	1.9%
심	5	1.9%
니	4	1.5%
사	4	1.5%
기	4	1.5%
의	4	1.5%
정	4	1.5%
도	4	1.5%
나	3	1.1%
Other values (170)	227	84.4%

Latin

Value	Count	Frequency (%)
o	1	16.7%
m	1	16.7%
c	1	16.7%
Q	1	16.7%
K	1	16.7%
F	1	16.7%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	269	97.8%
ASCII	6	2.2%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
이	5	1.9%
하	5	1.9%
심	5	1.9%
니	4	1.5%
사	4	1.5%
기	4	1.5%
의	4	1.5%
정	4	1.5%
도	4	1.5%
나	3	1.1%
Other values (170)	227	84.4%

ASCII

Value	Count	Frequency (%)
o	1	16.7%
m	1	16.7%
c	1	16.7%
Q	1	16.7%
K	1	16.7%
F	1	16.7%

TYPE
Categorical

HIGH CORRELATION IMBALANCE

Distinct	4
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

<NA>	9890
불용어	55
부정	30
긍정	25

Length

Max length	4
Median length	4
Mean length	3.9835
Min length	2

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	<NA>
2nd row	<NA>
3rd row	<NA>
4th row	<NA>
5th row	<NA>

Common Values

Value	Count	Frequency (%)
<NA>	9890	98.9%
불용어	55	0.5%
부정	30	0.3%
긍정	25	0.2%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
na	9890	98.9%
불용어	55	0.5%
부정	30	0.3%
긍정	25	0.2%

긍부정구분
Categorical

HIGH CORRELATION IMBALANCE

Distinct	3
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

<NA>	9947
부정	29
긍정	24

Length

Max length	4
Median length	4
Mean length	3.9894
Min length	2

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	<NA>
2nd row	<NA>
3rd row	<NA>
4th row	<NA>
5th row	<NA>

Common Values

Value	Count	Frequency (%)
<NA>	9947	99.5%
부정	29	0.3%
긍정	24	0.2%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
na	9947	99.5%
부정	29	0.3%
긍정	24	0.2%

단어
Text

MISSING

Distinct	53
Distinct (%)	100.0%
Missing	9947
Missing (%)	99.5%
Memory size	156.2 KiB

Length

Max length	3
Median length	2
Mean length	2.0943396
Min length	2

Characters and Unicode

Total characters	111
Distinct characters	88
Distinct categories	1 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	53 ?
Unique (%)	100.0%

Sample

1st row	시비
2nd row	즐거움
3rd row	거짓말
4th row	불편함
5th row	곤혹

Value	Count	Frequency (%)
대단	1	1.9%
안심	1	1.9%
진정	1	1.9%
다행	1	1.9%
활약	1	1.9%
지적	1	1.9%
합리	1	1.9%
해결	1	1.9%
온화	1	1.9%
지루	1	1.9%
Other values (43)	43	81.1%

Most occurring characters

Value	Count	Frequency (%)
비	3	2.7%
신	3	2.7%
혹	3	2.7%
우	3	2.7%
적	2	1.8%
덜	2	1.8%
단	2	1.8%
악	2	1.8%
진	2	1.8%
지	2	1.8%
Other values (78)	87	78.4%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	111	100.0%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
비	3	2.7%
신	3	2.7%
혹	3	2.7%
우	3	2.7%
적	2	1.8%
덜	2	1.8%
단	2	1.8%
악	2	1.8%
진	2	1.8%
지	2	1.8%
Other values (78)	87	78.4%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	111	100.0%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
비	3	2.7%
신	3	2.7%
혹	3	2.7%
우	3	2.7%
적	2	1.8%
덜	2	1.8%
단	2	1.8%
악	2	1.8%
진	2	1.8%
지	2	1.8%
Other values (78)	87	78.4%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	111	100.0%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
비	3	2.7%
신	3	2.7%
혹	3	2.7%
우	3	2.7%
적	2	1.8%
덜	2	1.8%
단	2	1.8%
악	2	1.8%
진	2	1.8%
지	2	1.8%
Other values (78)	87	78.4%

연번

연번

Heatmap
Table

	연번	TYPE	긍부정구분	단어
연번	1.000	NaN	NaN	NaN
TYPE	NaN	1.000	0.024	1.000
긍부정구분	NaN	0.024	1.000	1.000
단어	NaN	1.000	1.000	1.000

Heatmap
Table

	긍부정구분	TYPE
긍부정구분	1.000	0.028
TYPE	0.028	1.000

Heatmap
Table

	연번	TYPE	긍부정구분
연번	1.000	1.000	1.000
TYPE	1.000	1.000	0.028
긍부정구분	1.000	0.028	1.000

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

First rows
Last rows

	연번	KEYWORD	WORD	TYPE	긍부정구분	단어
38929	269389	인력부족	<NA>	<NA>	<NA>	<NA>
68667	661996	회차로변	<NA>	<NA>	<NA>	<NA>
47225	363889	좋치않	<NA>	<NA>	<NA>	<NA>
74254	735582	토석채취허	<NA>	<NA>	<NA>	<NA>
19132	81755	개판오분직전	<NA>	<NA>	<NA>	<NA>
74094	733856	방문에정	<NA>	<NA>	<NA>	<NA>
73033	721648	마산조각공원앞	<NA>	<NA>	<NA>	<NA>
68935	666819	연극제	<NA>	<NA>	<NA>	<NA>
38389	264373	창원종합운동장옆	<NA>	<NA>	<NA>	<NA>
76798	792338	끊겼다하	<NA>	<NA>	<NA>	<NA>

	연번	KEYWORD	WORD	TYPE	긍부정구분	단어
82973	881097	처리하는점에대하	<NA>	<NA>	<NA>	<NA>
46712	361057	박점숙	<NA>	<NA>	<NA>	<NA>
73809	731290	들껑거리는소음	<NA>	<NA>	<NA>	<NA>
26152	142817	안돌아오고있음	<NA>	<NA>	<NA>	<NA>
39482	274345	문제있는부분	<NA>	<NA>	<NA>	<NA>
31287	199068	어쭙고싶습니다	<NA>	<NA>	<NA>	<NA>
11952	38836	느꼇습니다	<NA>	<NA>	<NA>	<NA>
81578	865780	주차해놓은거	<NA>	<NA>	<NA>	<NA>
41994	307437	한림리치빌	<NA>	<NA>	<NA>	<NA>
94572	1073604	하겠되	<NA>	<NA>	<NA>	<NA>

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Space Separator

Most occurring scripts

Most frequent character per script

Hangul

Common

Most occurring blocks

Most frequent character per block

Hangul

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Lowercase Letter

Uppercase Letter

Most occurring scripts

Most frequent character per script

Hangul

Latin

Most occurring blocks

Most frequent character per block

Hangul

ASCII

Common Values

Length

Common Values (Plot)

Common Values

Length

Common Values (Plot)

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Most occurring scripts

Most frequent character per script

Hangul

Most occurring blocks

Most frequent character per block

Hangul

Interactions

Correlations

Missing values

Sample