【英文文本處理】單字學習 week 1

情境敘述

在設定每日看兩篇財經新聞後的現在，因財經政治領域的單字苦手而進度嚴重落後，在一一查詢生字的過程中想到了之前系統性處理文本資料的流程，或許能讓自己優先處理重點生字來加速對文章的理解，而開始下列的操作。

狀態敘述

僅使用單篇文章(先忽略統計操作帶來的助益)
未使用任何字集
無文章理解
不參考他人作法下的直覺(?)操作

使用套件

import re
from collections import Counter

流程

直接貼入文章後第一步要做的是切割文字成字集，由於英文字元稀少相當的友善，我們可以用
re 的反選擇將非字母和數字的字元通通挑出做為切割依據(*p1)

syb_set = "".join(list(set(re.findall(r"\W",article))))

```text


<mark>syb_set</mark>得到如此的字串

```python
":©. ( ”—‘)“,?]/$[- ’"

```text


但由於特定字的出現目前要手動跳脫成以下字串(*p2)

```python
"[:©.\u2009(\xa0”—‘)“,?\]/$\[\- ’
]"

```text


此時即可將文章切割成文字清單(文章順便全部改小寫方便計數)

```python
word_raw = re.split("[:©.\u2009(\xa0”—‘)“,?\]/$\[\- ’
]",article.lower())

```text


引入Counter的特性直接查看計數後的結果

```python
word_count = Counter(word_raw)
word_count
________output________
Counter({'': 336,
         '5tn': 1,
         'a': 49,
         'abide': 1,
         'able': 1,
         'about': 3,
         'accenture': 1,...

```text


挑掉不必要的部分如空字串、包含數字的詞和stop word，停用詞可從<mark>word_count.keys()</mark>中手動挑掉(自己造一個清單(*p3))，其他部分連同停用詞清單在重建字典時利用條件處理完成

```python
stop_word = ['and', 'are', 'the', 'have', 'in', 'a', 'but', ...]
cleaner_word_count = Counter({k:v for k,v in word_count.items() if
                              (not bool(re.search("\d",k))) and \
                              (k != "") and \
                              (k not in stop_word)})

```text


如此就可得到陽春版的詞集以及詞頻了
可用 <mark>cleaner_word_count.most_common()</mark> 查看依照詞頻排序的結果。

# 成果



- 簡易的文章詞彙、詞頻呈現。
- <del>一日英文的閱讀時間改成寫玩具所造成的壓力感</del>


# 問題點



- p1：縮寫和其他可能帶有符號的詞彙會被不當處理，字的變形會被視為不同類。
- p2、3：非完全自動，帶入字典、詞集的必要性高。
- <del>似乎對學英文沒什麼幫助</del>
- 若以生字學習為目標，在停用詞外可能還要附帶熟練度的抽象資料來影響呈現結果。


# 後續方向



- 列入足夠多的文章後開始帶入統計手法。
- 解決目前未自動化的部分
- 完善詞集建立。
- (遠程)嘗試連結具<mark>記憶曲線資料</mark>的應用。
- (遠程)嘗試建構英文學習的輔助閱讀介面(<mark>視覺化</mark>問題(但目前想到的都是英文教學者要求不要做的事，像是直接顯示中文之類的XD))


# 程式碼片段


```python
import re
from collections import Counter as ct

# 帶入自己的文章
article = """...
"""
# 此動作尚未自動化
#syb_set = "".join(list(set(re.findall(r"\W",article))))
syb_set = "[:©.\u2009(\xa0”—‘)“,?\]/$\[\- ’
]"
words_raw = re.split(syb_set, article.lower())
words_count = ct(words_raw)

# 需自行建構
stop_word = []
cleaner_words_count = ct({k:v for k,v in words_count.items() if 
                                    (not bool(re.search("\d",k))) and \
                                    (k != "") and \
                                    (k not in stop_word)})
cleaner_words_count.most_common()

```text


<!--
Post ID: 4694
Post Date: 2021-11-14 14:27:57
Categories: 114 - Python 字串處理
Tags: 114 - Python 字串處理, Python, 字串處理, 114
-->

【英文文本處理】單字學習 week 1 - 形成文章詞頻統計

紀錄統計英文文章字數—英文單字計算、空白分割、re、join、list、set、findall、u2009。

情境敘述

狀態敘述

使用套件

流程