Overview

Dataset statistics

Number of variables3
Number of observations10000
Missing cells0
Missing cells (%)0.0%
Duplicate rows374
Duplicate rows (%)3.7%
Total size in memory312.5 KiB
Average record size in memory32.0 B

Variable types

Text2
Categorical1

Dataset

DescriptionKnowTBT포털 회원의 관심품목에 대한 정보에 대한 데이터로 *대분류-소분류로 구분하여 제공 *ID는 비식별화됨 제공합니다.
URLhttps://www.data.go.kr/data/15068829/fileData.do

Alerts

Dataset has 374 (3.7%) duplicate rowsDuplicates

Reproduction

Analysis started2023-12-12 04:32:51.922567
Analysis finished2023-12-12 04:32:52.332976
Duration0.41 seconds
Software versionydata-profiling vv4.5.1
Download configurationconfig.json

Variables

Distinct65
Distinct (%)0.7%
Missing0
Missing (%)0.0%
Memory size156.2 KiB
2023-12-12T13:32:52.472488image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Length

Max length16
Median length12
Mean length7.2574
Min length2

Characters and Unicode

Total characters72574
Distinct characters145
Distinct categories7 ?
Distinct scripts3 ?
Distinct blocks2 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique11 ?
Unique (%)0.1%

Sample

1st row컴퓨터시스템
2nd row플라스틱 소재 및 제품
3rd row정보기반, 정보보안
4th row섬유소재 및 제품
5th row석유화학제품
ValueCountFrequency (%)
2070
 
10.8%
소재 718
 
3.7%
제품 588
 
3.1%
전자파 419
 
2.2%
가전기기 392
 
2.0%
환경기술 385
 
2.0%
환경자원 385
 
2.0%
유무선통신 366
 
1.9%
비금속 363
 
1.9%
금속 363
 
1.9%
Other values (89) 13133
68.5%
2023-12-12T13:32:52.815570image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Most occurring characters

ValueCountFrequency (%)
9750
 
13.4%
4886
 
6.7%
, 3276
 
4.5%
2512
 
3.5%
2070
 
2.9%
1605
 
2.2%
1485
 
2.0%
1290
 
1.8%
1235
 
1.7%
1225
 
1.7%
Other values (135) 43240
59.6%

Most occurring categories

ValueCountFrequency (%)
Other Letter 58770
81.0%
Space Separator 9750
 
13.4%
Other Punctuation 3578
 
4.9%
Open Punctuation 215
 
0.3%
Close Punctuation 215
 
0.3%
Uppercase Letter 41
 
0.1%
Lowercase Letter 5
 
< 0.1%

Most frequent character per category

Other Letter
ValueCountFrequency (%)
4886
 
8.3%
2512
 
4.3%
2070
 
3.5%
1605
 
2.7%
1485
 
2.5%
1290
 
2.2%
1235
 
2.1%
1225
 
2.1%
1190
 
2.0%
1186
 
2.0%
Other values (120) 40086
68.2%
Uppercase Letter
ValueCountFrequency (%)
E 15
36.6%
H 6
 
14.6%
W 5
 
12.2%
R 5
 
12.2%
S 5
 
12.2%
G 2
 
4.9%
C 1
 
2.4%
O 1
 
2.4%
V 1
 
2.4%
Other Punctuation
ValueCountFrequency (%)
, 3276
91.6%
/ 302
 
8.4%
Space Separator
ValueCountFrequency (%)
9750
100.0%
Open Punctuation
ValueCountFrequency (%)
( 215
100.0%
Close Punctuation
ValueCountFrequency (%)
) 215
100.0%
Lowercase Letter
ValueCountFrequency (%)
o 5
100.0%

Most occurring scripts

ValueCountFrequency (%)
Hangul 58770
81.0%
Common 13758
 
19.0%
Latin 46
 
0.1%

Most frequent character per script

Hangul
ValueCountFrequency (%)
4886
 
8.3%
2512
 
4.3%
2070
 
3.5%
1605
 
2.7%
1485
 
2.5%
1290
 
2.2%
1235
 
2.1%
1225
 
2.1%
1190
 
2.0%
1186
 
2.0%
Other values (120) 40086
68.2%
Latin
ValueCountFrequency (%)
E 15
32.6%
H 6
 
13.0%
W 5
 
10.9%
R 5
 
10.9%
S 5
 
10.9%
o 5
 
10.9%
G 2
 
4.3%
C 1
 
2.2%
O 1
 
2.2%
V 1
 
2.2%
Common
ValueCountFrequency (%)
9750
70.9%
, 3276
 
23.8%
/ 302
 
2.2%
( 215
 
1.6%
) 215
 
1.6%

Most occurring blocks

ValueCountFrequency (%)
Hangul 58770
81.0%
ASCII 13804
 
19.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
9750
70.6%
, 3276
 
23.7%
/ 302
 
2.2%
( 215
 
1.6%
) 215
 
1.6%
E 15
 
0.1%
H 6
 
< 0.1%
W 5
 
< 0.1%
R 5
 
< 0.1%
S 5
 
< 0.1%
Other values (5) 10
 
0.1%
Hangul
ValueCountFrequency (%)
4886
 
8.3%
2512
 
4.3%
2070
 
3.5%
1605
 
2.7%
1485
 
2.5%
1290
 
2.2%
1235
 
2.1%
1225
 
2.1%
1190
 
2.0%
1186
 
2.0%
Other values (120) 40086
68.2%

대분류명
Categorical

Distinct13
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Memory size156.2 KiB
기계
1178 
화학세라믹
1076 
바이오환경
964 
소재나노
949 
에너지
948 
Other values (8)
4885 

Length

Max length5
Median length4
Mean length3.891
Min length2

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row정보디지털
2nd row화학세라믹
3rd row정보디지털
4th row생활용품
5th row화학세라믹

Common Values

ValueCountFrequency (%)
기계 1178
11.8%
화학세라믹 1076
10.8%
바이오환경 964
9.6%
소재나노 949
9.5%
에너지 948
9.5%
정보디지털 937
9.4%
전기전자 878
8.8%
생활용품 689
6.9%
건설 688
6.9%
교통/안전 617
6.2%
Other values (3) 1076
10.8%

Length

2023-12-12T13:32:52.955216image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
기계 1178
11.8%
화학세라믹 1076
10.8%
바이오환경 964
9.6%
소재나노 949
9.5%
에너지 948
9.5%
정보디지털 937
9.4%
전기전자 878
8.8%
생활용품 689
6.9%
건설 688
6.9%
교통/안전 617
6.2%
Other values (3) 1076
10.8%
Distinct3347
Distinct (%)33.5%
Missing0
Missing (%)0.0%
Memory size156.2 KiB
2023-12-12T13:32:53.280950image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Length

Max length26
Median length23
Mean length7.4287
Min length5

Characters and Unicode

Total characters74287
Distinct characters65
Distinct categories6 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique1401 ?
Unique (%)14.0%

Sample

1st rowL****care
2nd rows****k77
3rd rowl****1969
4th rowe****a
5th rowd****9
ValueCountFrequency (%)
k 131
 
1.3%
s 117
 
1.2%
j 88
 
0.9%
h 57
 
0.6%
p 52
 
0.5%
d 42
 
0.4%
c 38
 
0.4%
e 37
 
0.4%
l 37
 
0.4%
m 37
 
0.4%
Other values (3311) 9364
93.6%
2023-12-12T13:32:53.863410image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Most occurring characters

ValueCountFrequency (%)
* 40000
53.8%
0 1948
 
2.6%
k 1871
 
2.5%
s 1837
 
2.5%
1 1756
 
2.4%
n 1724
 
2.3%
e 1481
 
2.0%
a 1421
 
1.9%
2 1385
 
1.9%
o 1313
 
1.8%
Other values (55) 19551
26.3%

Most occurring categories

ValueCountFrequency (%)
Other Punctuation 40189
54.1%
Lowercase Letter 22771
30.7%
Decimal Number 11038
 
14.9%
Uppercase Letter 281
 
0.4%
Connector Punctuation 6
 
< 0.1%
Dash Punctuation 2
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
k 1871
 
8.2%
s 1837
 
8.1%
n 1724
 
7.6%
e 1481
 
6.5%
a 1421
 
6.2%
o 1313
 
5.8%
h 1091
 
4.8%
g 1047
 
4.6%
j 1027
 
4.5%
i 1005
 
4.4%
Other values (16) 8954
39.3%
Uppercase Letter
ValueCountFrequency (%)
C 26
 
9.3%
K 25
 
8.9%
G 23
 
8.2%
O 19
 
6.8%
L 19
 
6.8%
S 15
 
5.3%
A 15
 
5.3%
W 14
 
5.0%
D 13
 
4.6%
F 13
 
4.6%
Other values (14) 99
35.2%
Decimal Number
ValueCountFrequency (%)
0 1948
17.6%
1 1756
15.9%
2 1385
12.5%
7 1243
11.3%
9 897
8.1%
3 863
7.8%
5 760
 
6.9%
8 751
 
6.8%
4 733
 
6.6%
6 702
 
6.4%
Other Punctuation
ValueCountFrequency (%)
* 40000
99.5%
. 128
 
0.3%
@ 61
 
0.2%
Connector Punctuation
ValueCountFrequency (%)
_ 6
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 2
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common 51235
69.0%
Latin 23052
31.0%

Most frequent character per script

Latin
ValueCountFrequency (%)
k 1871
 
8.1%
s 1837
 
8.0%
n 1724
 
7.5%
e 1481
 
6.4%
a 1421
 
6.2%
o 1313
 
5.7%
h 1091
 
4.7%
g 1047
 
4.5%
j 1027
 
4.5%
i 1005
 
4.4%
Other values (40) 9235
40.1%
Common
ValueCountFrequency (%)
* 40000
78.1%
0 1948
 
3.8%
1 1756
 
3.4%
2 1385
 
2.7%
7 1243
 
2.4%
9 897
 
1.8%
3 863
 
1.7%
5 760
 
1.5%
8 751
 
1.5%
4 733
 
1.4%
Other values (5) 899
 
1.8%

Most occurring blocks

ValueCountFrequency (%)
ASCII 74287
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
* 40000
53.8%
0 1948
 
2.6%
k 1871
 
2.5%
s 1837
 
2.5%
1 1756
 
2.4%
n 1724
 
2.3%
e 1481
 
2.0%
a 1421
 
1.9%
2 1385
 
1.9%
o 1313
 
1.8%
Other values (55) 19551
26.3%

Correlations

2023-12-12T13:32:53.973817image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
중분류명대분류명
중분류명1.0001.000
대분류명1.0001.000

Missing values

2023-12-12T13:32:52.220338image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
A simple visualization of nullity by column.
2023-12-12T13:32:52.300750image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

중분류명대분류명등록자ID
35512컴퓨터시스템정보디지털L****care
50624플라스틱 소재 및 제품화학세라믹s****k77
37152정보기반, 정보보안정보디지털l****1969
12766섬유소재 및 제품생활용품e****a
9359석유화학제품화학세라믹d****9
28791기계요소부품(나사, 볼트 등)소재나노k****@komma.org
15772가스기기 및 가스용기에너지g****09
27422피혁 및 신발류생활용품j****
46970생물자원, 천영물 원재료바이오환경r****2001
53160자동차교통/안전s****
중분류명대분류명등록자ID
25739정보기반, 정보보안정보디지털j****ox
32196컴퓨터시스템정보디지털k****ki
20872자동차교통/안전h****gjoon
17063기계요소부품(나사, 볼트 등)소재나노h****su
59371건축환경건설w****e
35141자동차교통/안전l****0153
22944금속, 비금속 소재소재나노j****635
9534자동차교통/안전d****sk
52657석유화학제품화학세라믹s****77
55747수산물농수산품t****2

Duplicate rows

Most frequently occurring

중분류명대분류명등록자ID# duplicates
334플라스틱 소재 및 제품화학세라믹s****10
56공작기계기계s****7
70금속, 비금속 소재소재나노s****7
244일반건설s****7
372환경자원, 환경기술바이오환경s****7
35건축환경건설k****6
163생물자원, 천영물 원재료바이오환경k****6
188승강기 및 부품기계k****6
272전자파전기전자k****6
10가스기기 및 가스용기에너지s****5