gimi9 Pandas Profiling

Dataset statistics

Number of variables	3
Number of observations	454
Missing cells	762
Missing cells (%)	55.9%
Duplicate rows	1
Duplicate rows (%)	0.2%
Total size in memory	11.7 KiB
Average record size in memory	26.3 B

Variable types

Numeric	2
Text	1

Dataset

Description	분야별 보도에서 월별로 가장 많이 등장한 명사를 200개 추출해 순위와 빈도를 제공뉴스데이터베이스 "BIGKinds" 에서 54개 신문방송의 뉴스를 분석한 메타정보https://www.bigkinds.or.kr 에 접속하면 보다 많은 정보를 확인할 수 있습니다.
Author	한국언론진흥재단
URL	https://www.data.go.kr/data/15065411/fileData.do

Alerts

Dataset has 1 (0.2%) duplicate rows	Duplicates
`순위` is highly overall correlated with `빈도수`	High correlation
`빈도수` is highly overall correlated with `순위`	High correlation
`순위` has 254 (55.9%) missing values	Missing
`키워드` has 254 (55.9%) missing values	Missing
`빈도수` has 254 (55.9%) missing values	Missing

Reproduction

Analysis started	2024-03-14 12:56:21.081445
Analysis finished	2024-03-14 12:56:23.245925
Duration	2.16 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

순위
Real number (ℝ)

HIGH CORRELATION MISSING

Distinct	200
Distinct (%)	100.0%
Missing	254
Missing (%)	55.9%
Infinite	0
Infinite (%)	0.0%
Mean	100.5

Minimum	1
Maximum	200
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	4.1 KiB

Quantile statistics

Minimum	1
5-th percentile	10.95
Q1	50.75
median	100.5
Q3	150.25
95-th percentile	190.05
Maximum	200
Range	199
Interquartile range (IQR)	99.5

Descriptive statistics

Standard deviation	57.879185
Coefficient of variation (CV)	0.57591228
Kurtosis	-1.2
Mean	100.5
Median Absolute Deviation (MAD)	50
Skewness	0
Sum	20100
Variance	3350
Monotonicity	Strictly increasing

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
139	1	0.2%
129	1	0.2%
130	1	0.2%
131	1	0.2%
132	1	0.2%
133	1	0.2%
134	1	0.2%
135	1	0.2%
136	1	0.2%
137	1	0.2%
Other values (190)	190	41.9%
(Missing)	254	55.9%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
1	1	0.2%
2	1	0.2%
3	1	0.2%
4	1	0.2%
5	1	0.2%
6	1	0.2%
7	1	0.2%
8	1	0.2%
9	1	0.2%
10	1	0.2%

Value	Count	Frequency (%)
200	1	0.2%
199	1	0.2%
198	1	0.2%
197	1	0.2%
196	1	0.2%
195	1	0.2%
194	1	0.2%
193	1	0.2%
192	1	0.2%
191	1	0.2%

키워드
Text

MISSING

Distinct	200
Distinct (%)	100.0%
Missing	254
Missing (%)	55.9%
Memory size	3.7 KiB

Length

Max length	8
Median length	3
Mean length	3.54
Min length	2

Characters and Unicode

Total characters	708
Distinct characters	251
Distinct categories	6 ?
Distinct scripts	3 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	200 ?
Unique (%)	100.0%

Sample

1st row	대통령
2nd row	민주당
3rd row	이재명
4th row	위원장
5th row	러시아

Value	Count	Frequency (%)
문재인	1	0.5%
홍익표	1	0.5%
선생님	1	0.5%
장관_후보자	1	0.5%
유인촌	1	0.5%
선거구	1	0.5%
중소기업	1	0.5%
해임건의안	1	0.5%
보스토치니	1	0.5%
항공청	1	0.5%
Other values (190)	190	95.0%

Most occurring characters

Value	Count	Frequency (%)
부	19	2.7%
_	17	2.4%
이	16	2.3%
대	16	2.3%
의	14	2.0%
회	14	2.0%
지	13	1.8%
원	12	1.7%
장	12	1.7%
인	11	1.6%
Other values (241)	564	79.7%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	674	95.2%
Connector Punctuation	17	2.4%
Uppercase Letter	10	1.4%
Lowercase Letter	4	0.6%
Decimal Number	2	0.3%
Other Punctuation	1	0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
부	19	2.8%
이	16	2.4%
대	16	2.4%
의	14	2.1%
회	14	2.1%
지	13	1.9%
원	12	1.8%
장	12	1.8%
인	11	1.6%
사	11	1.6%
Other values (225)	536	79.5%

Uppercase Letter

Value	Count	Frequency (%)
S	3	30.0%
R	1	10.0%
G	1	10.0%
D	1	10.0%
O	1	10.0%
C	1	10.0%
P	1	10.0%
N	1	10.0%

Lowercase Letter

Value	Count	Frequency (%)
y	1	25.0%
t	1	25.0%
r	1	25.0%
a	1	25.0%

Decimal Number

Value	Count	Frequency (%)
0	1	50.0%
2	1	50.0%

Connector Punctuation

Value	Count	Frequency (%)
_	17	100.0%

Other Punctuation

Value	Count	Frequency (%)
&	1	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	674	95.2%
Common	20	2.8%
Latin	14	2.0%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
부	19	2.8%
이	16	2.4%
대	16	2.4%
의	14	2.1%
회	14	2.1%
지	13	1.9%
원	12	1.8%
장	12	1.8%
인	11	1.6%
사	11	1.6%
Other values (225)	536	79.5%

Latin

Value	Count	Frequency (%)
S	3	21.4%
R	1	7.1%
G	1	7.1%
D	1	7.1%
O	1	7.1%
y	1	7.1%
C	1	7.1%
t	1	7.1%
r	1	7.1%
a	1	7.1%
Other values (2)	2	14.3%

Common

Value	Count	Frequency (%)
_	17	85.0%
&	1	5.0%
0	1	5.0%
2	1	5.0%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	674	95.2%
ASCII	34	4.8%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
부	19	2.8%
이	16	2.4%
대	16	2.4%
의	14	2.1%
회	14	2.1%
지	13	1.9%
원	12	1.8%
장	12	1.8%
인	11	1.6%
사	11	1.6%
Other values (225)	536	79.5%

ASCII

Value	Count	Frequency (%)
_	17	50.0%
S	3	8.8%
&	1	2.9%
R	1	2.9%
0	1	2.9%
2	1	2.9%
G	1	2.9%
D	1	2.9%
O	1	2.9%
y	1	2.9%
Other values (6)	6	17.6%

빈도수
Real number (ℝ)

HIGH CORRELATION MISSING

Distinct	159
Distinct (%)	79.5%
Missing	254
Missing (%)	55.9%
Infinite	0
Infinite (%)	0.0%
Mean	608.43

Minimum	151
Maximum	13415
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	4.1 KiB

Quantile statistics

Minimum	151
5-th percentile	154
Q1	189.25
median	297.5
Q3	504
95-th percentile	1345.9
Maximum	13415
Range	13264
Interquartile range (IQR)	314.75

Descriptive statistics

Standard deviation	1308.8554
Coefficient of variation (CV)	2.1512013
Kurtosis	55.326579
Mean	608.43
Median Absolute Deviation (MAD)	127.5
Skewness	6.8581703
Sum	121686
Variance	1713102.5
Monotonicity	Decreasing

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
152	4	0.9%
207	3	0.7%
151	3	0.7%
182	3	0.7%
159	3	0.7%
171	3	0.7%
526	3	0.7%
190	2	0.4%
195	2	0.4%
198	2	0.4%
Other values (149)	172	37.9%
(Missing)	254	55.9%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
151	3	0.7%
152	4	0.9%
153	2	0.4%
154	2	0.4%
159	3	0.7%
161	2	0.4%
162	2	0.4%
163	2	0.4%
165	2	0.4%
166	1	0.2%

Value	Count	Frequency (%)
13415	1	0.2%
8429	1	0.2%
6821	1	0.2%
5449	1	0.2%
4921	1	0.2%
3141	1	0.2%
1984	1	0.2%
1806	1	0.2%
1755	1	0.2%
1515	1	0.2%

순위
빈도수

빈도수
순위

빈도수
순위

Phik (φk)
Auto

Heatmap
Table

	순위	빈도수
순위	1.000	0.352
빈도수	0.352	1.000

Heatmap
Table

	순위	빈도수
순위	1.000	-1.000
빈도수	-1.000	1.000

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

First rows
Last rows

	순위	키워드	빈도수
0	1	대통령	13415
1	2	민주당	8429
2	3	이재명	6821
3	4	위원장	5449
4	5	러시아	4921
5	6	윤석열	3141
6	7	김정은	1984
7	8	대통령실	1806
8	9	더불어민주당	1755
9	10	본회의	1515

	순위	키워드	빈도수
444	<NA>	<NA>	<NA>
445	<NA>	<NA>	<NA>
446	<NA>	<NA>	<NA>
447	<NA>	<NA>	<NA>
448	<NA>	<NA>	<NA>
449	<NA>	<NA>	<NA>
450	<NA>	<NA>	<NA>
451	<NA>	<NA>	<NA>
452	<NA>	<NA>	<NA>
453	<NA>	<NA>	<NA>

Most frequently occurring

	순위	키워드	빈도수	# duplicates
0	<NA>	<NA>	<NA>	254

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Uppercase Letter

Lowercase Letter

Decimal Number

Connector Punctuation

Other Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Latin

Common

Most occurring blocks

Most frequent character per block

Hangul

ASCII

Interactions

Correlations

Missing values

Sample

Duplicate rows

Most frequently occurring