gimi9 Pandas Profiling

Dataset statistics

Number of variables	3
Number of observations	274
Missing cells	222
Missing cells (%)	27.0%
Duplicate rows	1
Duplicate rows (%)	0.4%
Total size in memory	7.1 KiB
Average record size in memory	26.5 B

Variable types

Numeric	2
Text	1

Dataset

Description	뉴스데이터베이스 "BIGKinds" 에서 54개 신문방송의 뉴스를 분석한 메타정보.분야별 보도에서 월별로 가장 많이 등장한 명사를 200개 추출해 순위와 빈도를 제공https://www.bigkinds.or.kr 에 접속하면 보다 많은 정보를 확인할 수 있습니다.
Author	한국언론진흥재단
URL	https://www.data.go.kr/data/15068899/fileData.do

Alerts

Dataset has 1 (0.4%) duplicate rows	Duplicates
`순위` is highly overall correlated with `빈도수`	High correlation
`빈도수` is highly overall correlated with `순위`	High correlation
`순위` has 74 (27.0%) missing values	Missing
`키워드` has 74 (27.0%) missing values	Missing
`빈도수` has 74 (27.0%) missing values	Missing

Reproduction

Analysis started	2024-03-14 13:57:21.704112
Analysis finished	2024-03-14 13:57:23.315574
Duration	1.61 second
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

순위
Real number (ℝ)

HIGH CORRELATION MISSING

Distinct	200
Distinct (%)	100.0%
Missing	74
Missing (%)	27.0%
Infinite	0
Infinite (%)	0.0%
Mean	100.5

Minimum	1
Maximum	200
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	2.5 KiB

Quantile statistics

Minimum	1
5-th percentile	10.95
Q1	50.75
median	100.5
Q3	150.25
95-th percentile	190.05
Maximum	200
Range	199
Interquartile range (IQR)	99.5

Descriptive statistics

Standard deviation	57.879185
Coefficient of variation (CV)	0.57591228
Kurtosis	-1.2
Mean	100.5
Median Absolute Deviation (MAD)	50
Skewness	0
Sum	20100
Variance	3350
Monotonicity	Strictly increasing

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
139	1	0.4%
129	1	0.4%
130	1	0.4%
131	1	0.4%
132	1	0.4%
133	1	0.4%
134	1	0.4%
135	1	0.4%
136	1	0.4%
137	1	0.4%
Other values (190)	190	69.3%
(Missing)	74	27.0%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
1	1	0.4%
2	1	0.4%
3	1	0.4%
4	1	0.4%
5	1	0.4%
6	1	0.4%
7	1	0.4%
8	1	0.4%
9	1	0.4%
10	1	0.4%

Value	Count	Frequency (%)
200	1	0.4%
199	1	0.4%
198	1	0.4%
197	1	0.4%
196	1	0.4%
195	1	0.4%
194	1	0.4%
193	1	0.4%
192	1	0.4%
191	1	0.4%

키워드
Text

MISSING

Distinct	200
Distinct (%)	100.0%
Missing	74
Missing (%)	27.0%
Memory size	2.3 KiB

Length

Max length	6
Median length	3
Mean length	3.45
Min length	3

Characters and Unicode

Total characters	690
Distinct characters	240
Distinct categories	5 ?
Distinct scripts	3 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	200 ?
Unique (%)	100.0%

Sample

1st row	서비스
2nd row	글로벌
3rd row	반도체
4th row	에너지
5th row	소비자

Value	Count	Frequency (%)
디지털	1	0.5%
연체율	1	0.5%
운반선	1	0.5%
금융위	1	0.5%
우크라이나	1	0.5%
환경부	1	0.5%
lng	1	0.5%
오염수	1	0.5%
하나은행	1	0.5%
관광객	1	0.5%
Other values (190)	190	95.0%

Most occurring characters

Value	Count	Frequency (%)
스	21	3.0%
이	20	2.9%
자	19	2.8%
트	11	1.6%
아	10	1.4%
산	9	1.3%
리	8	1.2%
전	8	1.2%
사	8	1.2%
플	8	1.2%
Other values (230)	568	82.3%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	625	90.6%
Uppercase Letter	60	8.7%
Connector Punctuation	2	0.3%
Decimal Number	2	0.3%
Other Punctuation	1	0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
스	21	3.4%
이	20	3.2%
자	19	3.0%
트	11	1.8%
아	10	1.6%
산	9	1.4%
리	8	1.3%
전	8	1.3%
사	8	1.3%
플	8	1.3%
Other values (205)	503	80.5%

Uppercase Letter

Value	Count	Frequency (%)
G	5	8.3%
S	5	8.3%
O	5	8.3%
M	4	6.7%
I	4	6.7%
D	4	6.7%
C	3	5.0%
B	3	5.0%
P	3	5.0%
L	3	5.0%
Other values (11)	21	35.0%

Decimal Number

Value	Count	Frequency (%)
9	1	50.0%
1	1	50.0%

Connector Punctuation

Value	Count	Frequency (%)
_	2	100.0%

Other Punctuation

Value	Count	Frequency (%)
&	1	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	625	90.6%
Latin	60	8.7%
Common	5	0.7%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
스	21	3.4%
이	20	3.2%
자	19	3.0%
트	11	1.8%
아	10	1.6%
산	9	1.4%
리	8	1.3%
전	8	1.3%
사	8	1.3%
플	8	1.3%
Other values (205)	503	80.5%

Latin

Value	Count	Frequency (%)
G	5	8.3%
S	5	8.3%
O	5	8.3%
M	4	6.7%
I	4	6.7%
D	4	6.7%
C	3	5.0%
B	3	5.0%
P	3	5.0%
L	3	5.0%
Other values (11)	21	35.0%

Common

Value	Count	Frequency (%)
_	2	40.0%
9	1	20.0%
1	1	20.0%
&	1	20.0%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	625	90.6%
ASCII	65	9.4%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
스	21	3.4%
이	20	3.2%
자	19	3.0%
트	11	1.8%
아	10	1.6%
산	9	1.4%
리	8	1.3%
전	8	1.3%
사	8	1.3%
플	8	1.3%
Other values (205)	503	80.5%

ASCII

Value	Count	Frequency (%)
G	5	7.7%
S	5	7.7%
O	5	7.7%
M	4	6.2%
I	4	6.2%
D	4	6.2%
C	3	4.6%
B	3	4.6%
P	3	4.6%
L	3	4.6%
Other values (15)	26	40.0%

빈도수
Real number (ℝ)

HIGH CORRELATION MISSING

Distinct	165
Distinct (%)	82.5%
Missing	74
Missing (%)	27.0%
Infinite	0
Infinite (%)	0.0%
Mean	614.07

Minimum	219
Maximum	6103
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	2.5 KiB

Quantile statistics

Minimum	219
5-th percentile	225.95
Q1	267
median	349.5
Q3	535
95-th percentile	1990.9
Maximum	6103
Range	5884
Interquartile range (IQR)	268

Descriptive statistics

Standard deviation	788.6471
Coefficient of variation (CV)	1.2842951
Kurtosis	20.775606
Mean	614.07
Median Absolute Deviation (MAD)	95
Skewness	4.1426905
Sum	122814
Variance	621964.25
Monotonicity	Decreasing

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
265	5	1.8%
224	4	1.5%
271	3	1.1%
232	3	1.1%
257	2	0.7%
317	2	0.7%
260	2	0.7%
261	2	0.7%
751	2	0.7%
309	2	0.7%
Other values (155)	173	63.1%
(Missing)	74	27.0%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
219	1	0.4%
220	2	0.7%
221	1	0.4%
224	4	1.5%
225	2	0.7%
226	1	0.4%
228	2	0.7%
229	1	0.4%
230	1	0.4%
231	1	0.4%

Value	Count	Frequency (%)
6103	1	0.4%
5465	1	0.4%
3906	1	0.4%
3517	1	0.4%
3022	1	0.4%
2911	1	0.4%
2892	1	0.4%
2424	1	0.4%
2347	1	0.4%
2179	1	0.4%

순위
빈도수

빈도수
순위

빈도수
순위

Phik (φk)
Auto

Heatmap
Table

	순위	빈도수
순위	1.000	0.625
빈도수	0.625	1.000

Heatmap
Table

	순위	빈도수
순위	1.000	-1.000
빈도수	-1.000	1.000

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

First rows
Last rows

	순위	키워드	빈도수
0	1	서비스	6103
1	2	글로벌	5465
2	3	반도체	3906
3	4	에너지	3517
4	5	소비자	3022
5	6	부동산	2911
6	7	전기차	2892
7	8	아파트	2424
8	9	자동차	2347
9	10	투자자	2179

	순위	키워드	빈도수
264	<NA>	<NA>	<NA>
265	<NA>	<NA>	<NA>
266	<NA>	<NA>	<NA>
267	<NA>	<NA>	<NA>
268	<NA>	<NA>	<NA>
269	<NA>	<NA>	<NA>
270	<NA>	<NA>	<NA>
271	<NA>	<NA>	<NA>
272	<NA>	<NA>	<NA>
273	<NA>	<NA>	<NA>

Most frequently occurring

	순위	키워드	빈도수	# duplicates
0	<NA>	<NA>	<NA>	74

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Uppercase Letter

Decimal Number

Connector Punctuation

Other Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Latin

Common

Most occurring blocks

Most frequent character per block

Hangul

ASCII

Interactions

Correlations

Missing values

Sample

Duplicate rows

Most frequently occurring