Skip to content

Notebook for data cleaning#

Reference: https://www.justintodata.com/data-cleaning-techniques-python-guide/#what-is-data-cleaning-and-why-is-it-important

Data Cleaning: Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

    Methods & techniques in Python on how to find and clean:
  • Missing data
  • Irregular data (outliers)
  • Unnecessary data — repetitive data, duplicates, and more
  • Inconsistent data — capitalization, data types, typos, addresses Raw data is always messy, may suffer from various quality issues. We can't use it as it is. If you use such data for analysis, for example, feed into a machine learning model, you’ll get useless insights most of the time. That’s why data cleansing is a critical process for data analysts and data scientists.
      Useful Python Libraries
    • pandas: a popular data analysis and manipulation tool, which will be used for most of our data cleaning techniques
    • seaborn: statistical data visualization library
    • Missingno Python library that provides a series of visualisations to understand the presence and distribution of missing data within a pandas dataframe.
    • nltk: natural language toolkit Case Study --> Russian housing market dataset The goal of the project is to predict housing prices.
      import pandas as pd
      df = pd.read_csv("sberbank-russian-housing-market/train/train.csv")
      df.head()
      df.shape
      
      (30471, 292) There are 292 columns and 30471 rows in the data set
      # to check datatypes 
      df.info()
      
      RangeIndex: 30471 entries, 0 to 30470 Columns: 292 entries, id to price_doc dtypes: float64(119), int64(157), object(16) memory usage: 67.9+ MB
      # to identify numeric and non-numeric attributes in the dataset
      numeric_cols = df.select_dtypes(include=['number']).columns
      non_numeric_cols = df.select_dtypes(exclude=['number']).columns
      
      numeric_cols
      
      Index(['id', 'full_sq', 'life_sq', 'floor', 'max_floor', 'material', 'build_year', 'num_room', 'kitch_sq', 'state', ... 'cafe_count_5000_price_2500', 'cafe_count_5000_price_4000', 'cafe_count_5000_price_high', 'big_church_count_5000', 'church_count_5000', 'mosque_count_5000', 'leisure_count_5000', 'sport_count_5000', 'market_count_5000', 'price_doc'], dtype='object', length=276)
      non_numeric_cols
      
      Index(['timestamp', 'product_type', 'sub_area', 'culture_objects_top_25', 'thermal_power_plant_raion', 'incineration_raion', 'oil_chemistry_raion', 'radiation_raion', 'railroad_terminal_raion', 'big_market_raion', 'nuclear_reactor_raion', 'detention_facility_raion', 'water_1line', 'big_road1_1line', 'railroad_1line', 'ecology'], dtype='object') # Missing Data
        When data is missing for a column in a row. Ways to handle it:
      • How to find out?
      • Method #1: missing data (by columns) count & percentage
        df[non_numeric_cols].info()
        
        RangeIndex: 30471 entries, 0 to 30470 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timestamp 30471 non-null object 1 product_type 30471 non-null object 2 sub_area 30471 non-null object 3 culture_objects_top_25 30471 non-null object 4 thermal_power_plant_raion 30471 non-null object 5 incineration_raion 30471 non-null object 6 oil_chemistry_raion 30471 non-null object 7 radiation_raion 30471 non-null object 8 railroad_terminal_raion 30471 non-null object 9 big_market_raion 30471 non-null object 10 nuclear_reactor_raion 30471 non-null object 11 detention_facility_raion 30471 non-null object 12 water_1line 30471 non-null object 13 big_road1_1line 30471 non-null object 14 railroad_1line 30471 non-null object 15 ecology 30471 non-null object dtypes: object(16) memory usage: 3.7+ MB all counts are same for non-numeric columns, hence no missing data
        num_missing = df.isna().sum()
        num_missing[:10]
        
        id 0 timestamp 0 full_sq 0 life_sq 6383 floor 167 max_floor 9572 material 9572 build_year 13605 num_room 9572 kitch_sq 9572 dtype: int64
        # to caclulate mean of missing values by columns
        num_missing = df.isna().mean()
        num_missing[:10]
        
        id 0.000000 timestamp 0.000000 full_sq 0.000000 life_sq 0.209478 floor 0.005481 max_floor 0.314135 material 0.314135 build_year 0.446490 num_room 0.314135 kitch_sq 0.314135 dtype: float64
      • Method #2: missing data (by columns) heatmap
        # to visualise missing values
        # heapmap can be created using "seaborn" and "missingno" libraries
        import seaborn as sns
        
        # since number of columns are very large, it would be difficult to visualise them at the same time. We can learn the patter
        # pattern of missing data for the first 30 columns
        cols= df.columns[:30]
        colors = ["green", "blue"]
        sns.heatmap(df[cols].isna(), cmap=sns.color_palette(colors))
        
        ![[Assets/DataVisualizationNotebook/output_19_1.png]] the column life_sq has missing values across different rows. While the column max_floor has most of its missing values # missingno The missingno library is a small toolset focused on missing data visualizations and utilities. So you can get the same missing data heatmap as above with shorter code.
        !pip install missingno
        import missingno as msno
        msno.matrix(df.iloc[:, :30])
        
        Requirement already satisfied: missingno in c:\users\bits-wilp\anaconda3\lib\site-packages (0.5.0) Requirement already satisfied: matplotlib in c:\users\bits-wilp\anaconda3\lib\site-packages (from missingno) (3.2.2) Requirement already satisfied: numpy in c:\users\bits-wilp\anaconda3\lib\site-packages (from missingno) (1.18.5) Requirement already satisfied: scipy in c:\users\bits-wilp\anaconda3\lib\site-packages (from missingno) (1.5.0) Requirement already satisfied: seaborn in c:\users\bits-wilp\anaconda3\lib\site-packages (from missingno) (0.10.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\bits-wilp\anaconda3\lib\site-packages (from matplotlib->missingno) (1.2.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\bits-wilp\anaconda3\lib\site-packages (from matplotlib->missingno) (2.4.7) Requirement already satisfied: cycler>=0.10 in c:\users\bits-wilp\anaconda3\lib\site-packages (from matplotlib->missingno) (0.10.0) Requirement already satisfied: python-dateutil>=2.1 in c:\users\bits-wilp\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.1) Requirement already satisfied: pandas>=0.22.0 in c:\users\bits-wilp\anaconda3\lib\site-packages (from seaborn->missingno) (1.0.5) Requirement already satisfied: six in c:\users\bits-wilp\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib->missingno) (1.15.0) Requirement already satisfied: pytz>=2017.2 in c:\users\bits-wilp\anaconda3\lib\site-packages (from pandas>=0.22.0->seaborn->missingno) (2020.1) ![output_22_2.png](../../../Assets/DataCleaningNotebook/output_22_2.png)
      • Method #3: missing data (by rows) histogram
      • summarize the missing data by rows.
        missing_by_row = df.isna().sum(axis='columns')
        missing_by_row.hist(bins=50)
        
        ![[Assets/DataVisualizationNotebook/output_24_1.png]] This histogram helps to identify the missing patterns among the 30,471 observations. For example, there are over 6,000 observations with no missing values, and close to 4,000 observations with 1 missing value.
          How to handle missing data
        • Methods:
        • Technique #1: drop columns / features: drop the entire column or features with large number of missing values. But, this will cause a loss of information. Let’s consider the columns with a high percentage of missing.
          num_missing[num_missing > 0.3]
          
          max_floor 0.314135 material 0.314135 build_year 0.446490 num_room 0.314135 kitch_sq 0.314135 state 0.444980 hospital_beds_raion 0.473926 cafe_sum_500_min_price_avg 0.435857 cafe_sum_500_max_price_avg 0.435857 cafe_avg_price_500 0.435857 dtype: float64
          df.drop(['max_floor', 'material', 'build_year', 'num_room', 'kitch_sq','state','hospital_beds_raion', 'cafe_sum_500_min_price_avg','cafe_sum_500_max_price_avg', 'cafe_avg_price_500'],axis = 1)
          df.head()
          
          id timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq ... cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000 price_doc
          0 1 2011-08-20 43 27.0 4.0 NaN NaN NaN NaN NaN ... 9 4 0 13 22 1 0 52 4 5850000
          1 2 2011-08-23 34 19.0 3.0 NaN NaN NaN NaN NaN ... 15 3 0 15 29 1 10 66 14 6000000
          2 3 2011-08-27 43 29.0 2.0 NaN NaN NaN NaN NaN ... 10 3 0 11 27 0 4 67 10 5700000
          3 4 2011-09-01 89 50.0 9.0 NaN NaN NaN NaN NaN ... 11 2 1 4 4 0 0 26 3 13100000
          4 5 2011-09-05 77 77.0 4.0 NaN NaN NaN NaN NaN ... 319 108 17 135 236 2 91 195 14 16331452

          5 rows × 292 columns

        • Technique #2: drop rows / observations
          df.dropna(axis=0, how='any', thresh=257, subset=None, inplace=True)
          # thresh argument that specifies the number of non-missing values that should be present for each row in order not to be dropped.
          df
          
          id timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq ... cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000 price_doc
          0 1 2011-08-20 43 27.0 4.0 NaN NaN NaN NaN NaN ... 9 4 0 13 22 1 0 52 4 5850000
          1 2 2011-08-23 34 19.0 3.0 NaN NaN NaN NaN NaN ... 15 3 0 15 29 1 10 66 14 6000000
          2 3 2011-08-27 43 29.0 2.0 NaN NaN NaN NaN NaN ... 10 3 0 11 27 0 4 67 10 5700000
          3 4 2011-09-01 89 50.0 9.0 NaN NaN NaN NaN NaN ... 11 2 1 4 4 0 0 26 3 13100000
          4 5 2011-09-05 77 77.0 4.0 NaN NaN NaN NaN NaN ... 319 108 17 135 236 2 91 195 14 16331452
          ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
          30466 30469 2015-06-30 44 27.0 7.0 9.0 1.0 1975.0 2.0 6.0 ... 15 5 0 15 26 1 2 84 6 7400000
          30467 30470 2015-06-30 86 59.0 3.0 9.0 2.0 1935.0 4.0 10.0 ... 313 128 24 98 182 1 82 171 15 25000000
          30468 30471 2015-06-30 45 NaN 10.0 20.0 1.0 NaN 1.0 1.0 ... 1 1 0 2 12 0 1 11 1 6970959
          30469 30472 2015-06-30 64 32.0 5.0 15.0 1.0 2003.0 2.0 11.0 ... 22 1 1 6 31 1 4 65 7 13500000
          30470 30473 2015-06-30 43 28.0 1.0 9.0 1.0 1968.0 2.0 6.0 ... 5 2 0 7 16 0 9 54 10 5600000

          29779 rows × 292 columns

        • Technique #3: impute the missing with constant values.
          df_copy = df.copy()
          numeric_cols = df_copy.select_dtypes(include=['number']).columns
          df_copy[numeric_cols] = df_copy[numeric_cols].fillna(-999)
          df_copy[non_numeric_cols] = df_copy[non_numeric_cols].fillna('_MISSING_')
          
        • Technique #4: impute the missing with statistics
          #imputing numeric columns by median
          med = df_copy[numeric_cols].median()
          df_copy[numeric_cols] = df_copy[numeric_cols].fillna(med)
          
          # imputing non-numeric columns by mode
          most_freq = df_copy[non_numeric_cols].mode()
          df_copy[non_numeric_cols] = df_copy[non_numeric_cols].fillna(most_freq)
          
          most_freq
          
          timestamp product_type sub_area culture_objects_top_25 thermal_power_plant_raion incineration_raion oil_chemistry_raion radiation_raion railroad_terminal_raion big_market_raion nuclear_reactor_raion detention_facility_raion water_1line big_road1_1line railroad_1line ecology
          0 2014-12-16 Investment Poselenie Sosenskoe no no no no no no no no no no no no poor
          #to check for missing values
          df_copy.isna().sum()
          
          id 0 timestamp 0 full_sq 0 life_sq 0 floor 0 .. mosque_count_5000 0 leisure_count_5000 0 sport_count_5000 0 market_count_5000 0 price_doc 0 Length: 292, dtype: int64 # Irregular data (outliers) Outliers could bias our data analysis results, providing a misleading representation of the data. Outliers could be real outliers or mistakes.
            How to detect?
          • Method #1: descriptive statistics
            df_copy.describe()
            
            id full_sq life_sq floor max_floor material build_year num_room kitch_sq state ... big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000 price_doc year month weekday
            count 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 ... 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 2.977900e+04 29779.000000 29779.000000 29779.000000
            mean 3.927294e-17 1.133705e-15 -6.050975e-16 -3.694486e-15 -3.424010e-14 2.521536e-16 6.337155e-16 1.510536e-13 -2.153204e-13 1.240557e-13 ... 1.864436e-15 -1.152501e-16 -2.969346e-15 1.061662e-15 -3.434236e-16 3.142879e-17 3.015190e-15 2013.453105 6.744014 2.196783
            std 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 ... 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 1.000017e+00 0.962059 3.523263 1.576159
            min -1.733558e+00 -1.413996e+00 -2.004501e+00 -1.344496e+01 -1.484404e+00 -1.484503e+00 -1.975412e-02 -1.484507e+00 -1.482597e+00 -1.130673e+00 ... -5.234219e-01 -6.453354e-01 -7.320273e-01 -4.259748e-01 -1.165510e+00 -1.251928e+00 -1.477355e+00 2011.000000 1.000000 0.000000
            25% -8.634904e-01 -4.234363e-01 4.506506e-01 1.039594e-02 -1.484404e+00 -1.484503e+00 -1.975412e-02 -1.484507e+00 -1.482597e+00 -1.130673e+00 ... -4.553321e-01 -4.568299e-01 -7.320273e-01 -4.259748e-01 -9.059867e-01 -1.046875e+00 -4.935775e-01 2013.000000 4.000000 1.000000
            50% 1.914127e-03 -1.366954e-01 4.748155e-01 6.410995e-02 6.657718e-01 6.718609e-01 5.709977e-03 6.716512e-01 6.609854e-01 8.821739e-01 ... -2.851076e-01 -3.102146e-01 -7.320273e-01 -3.297071e-01 -1.057884e-01 -2.266611e-01 -1.863383e-01 2014.000000 6.000000 2.000000
            75% 8.653282e-01 2.282476e-01 5.013968e-01 1.178240e-01 6.807036e-01 6.718609e-01 5.916512e-03 6.738074e-01 6.759905e-01 8.841867e-01 ... -1.148831e-01 -5.887390e-02 9.015939e-01 -8.903773e-02 4.781401e-01 1.003659e+00 2.398021e-01 2014.000000 10.000000 3.000000
            max 1.732382e+00 1.374207e+02 1.848007e+01 1.004105e+00 8.961478e-01 6.826427e-01 1.725493e+02 7.104621e-01 4.976016e+00 9.465850e-01 ... 4.617358e+00 4.590928e+00 2.535215e+00 4.676216e+00 3.549171e+00 3.054193e+00 2.163833e+01 2015.000000 12.000000 6.000000

            8 rows × 279 columns

          • Method #2: histogram & box plot
            df_copy['full_sq'].hist(bins=100)
            
            ![output_43_1.png](../../../Assets/DataCleaningNotebook/output_43_1.png)
            sns.boxplot(df_copy['full_sq'])
            
            ![output_44_1.png](../../../Assets/DataCleaningNotebook/output_44_1.png)
          • Method #3: bar chart
            # to check outliers in categorical attributes
            df['ecology'].value_counts().plot(kind='bar')
             # But if there is a category with only one value called ‘extraordinary’, that could be considered an ‘outlier’.
            
            ![output_46_1.png](../../../Assets/DataCleaningNotebook/output_46_1.png)
          • How to handle missing data
          • Unnecessary data
          • Unnecessary type #1: repetitive & uninformative When an extremely high percentage of the column has a repetitive value,
            num_rows = len(df_copy)
            
            for col in df_copy.columns:
                cnts = df_copy[col].value_counts(dropna=False)
                top_pct = (cnts/num_rows).iloc[0]
            
                if top_pct > 0.999:
                    print('{0}: {1:.2f}%'.format(col, top_pct*100))
                    print(cnts)
                    print()
            
          • Unnecessary type #2: irrelevant If features are not related to the question we are trying to solve, they are irrelevant. Use "drop()" method to drop such features
          • Unnecessary type #3: duplicates
          • duplicate occurs when all the columns’ values within the observations are the same.
            df_copy[df_copy.duplicated()]
            
            id timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq ... cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000 price_doc

            0 rows × 292 columns

            #We first drop id, and then see if there are duplicated rows from the DataFrame
            
            df_copy[df_copy.drop(columns=['id']).duplicated()]
            
            id timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq ... cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000 price_doc
            4328 4331 2012-10-22 61 -999.0 18.0 -999.0 -999.0 -999.0 -999.0 -999.0 ... 11 2 1 5 4 0 1 32 5 8248500
            6991 6994 2013-04-03 42 -999.0 2.0 -999.0 -999.0 -999.0 -999.0 -999.0 ... 3 2 0 2 16 1 0 20 4 3444000
            8059 8062 2013-05-22 68 -999.0 2.0 -999.0 -999.0 -999.0 -999.0 -999.0 ... 3 2 0 2 16 1 0 20 4 5406690
            8653 8656 2013-06-24 40 -999.0 12.0 -999.0 -999.0 -999.0 -999.0 -999.0 ... 1 0 0 4 6 0 0 4 1 4112800
            14004 14007 2014-01-22 46 28.0 1.0 9.0 1.0 1968.0 2.0 5.0 ... 10 1 0 13 15 1 1 61 4 3000000
            17404 17407 2014-04-15 134 134.0 1.0 1.0 1.0 0.0 3.0 0.0 ... 0 0 0 0 1 0 0 0 0 5798496
            26675 26678 2014-12-17 62 -999.0 9.0 17.0 1.0 -999.0 2.0 1.0 ... 371 141 26 150 249 2 105 203 13 6552000
            28361 28364 2015-03-14 62 -999.0 2.0 17.0 1.0 -999.0 2.0 1.0 ... 371 141 26 150 249 2 105 203 13 6520500
            28712 28715 2015-03-30 41 41.0 11.0 17.0 1.0 2016.0 1.0 41.0 ... 2 2 0 2 9 0 0 7 2 4114580

            9 rows × 292 columns

            df_dedupped = df.drop(columns=['id']).drop_duplicates()
            
            print(df.shape)
            print(df_dedupped.shape)
            
            (29779, 292) (29770, 291) # Inconsistent data
          • Inconsistent type #1: capitalization Inconsistent use of upper and lower cases in categorical values is typical. We need to clean it since Python is case-sensitive.
            # to print full numpy array without ...
            #import numpy as np
            #import sys
            #np.set_printoptions(threshold=sys.maxsize)
            
            
            # to display the entire list without ...
            with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
                display(df_copy['sub_area'].value_counts(dropna=False))
            
            Poselenie Sosenskoe 1617 Nekrasovka 1611 Poselenie Vnukovskoe 1290 Poselenie Moskovskij 925 Poselenie Voskresenskoe 713 Mitino 679 Tverskoe 678 Krjukovo 518 Mar'ino 508 Juzhnoe Butovo 451 Poselenie Shherbinka 443 Solncevo 421 Zapadnoe Degunino 410 Poselenie Desjonovskoe 362 Otradnoe 353 Nagatinskij Zaton 327 Nagornoe 305 Bogorodskoe 305 Strogino 301 Izmajlovo 300 Tekstil'shhiki 298 Ljublino 297 Gol'janovo 295 Severnoe Tushino 282 Chertanovo Juzhnoe 273 Birjulevo Vostochnoe 268 Vyhino-Zhulebino 264 Horoshevo-Mnevniki 262 Zjuzino 259 Ochakovo-Matveevskoe 255 Perovo 247 Ramenki 241 Jasenevo 237 Kosino-Uhtomskoe 237 Bibirevo 230 Golovinskoe 224 Poselenie Filimonkovskoe 221 Caricyno 220 Kuz'minki 220 Kon'kovo 220 Veshnjaki 213 Akademicheskoe 211 Orehovo-Borisovo Juzhnoe 208 Koptevo 207 Orehovo-Borisovo Severnoe 206 Novogireevo 201 Chertanovo Severnoe 200 Danilovskoe 199 Ivanovskoe 197 Mozhajskoe 197 Chertanovo Central'noe 196 Pechatniki 192 Presnenskoe 190 Sokolinaja Gora 188 Obruchevskoe 185 Kuncevo 184 Brateevo 182 Severnoe Butovo 182 Rjazanskij 180 Hovrino 178 Losinoostrovskoe 177 Juzhnoe Tushino 175 Dmitrovskoe 174 Taganskoe 173 Severnoe Medvedkovo 167 Beskudnikovskoe 166 Teplyj Stan 165 Pokrovskoe Streshnevo 164 Severnoe Izmajlovo 163 Cheremushki 158 Nagatino-Sadovniki 158 Troickij okrug 158 Shhukino 155 Timirjazevskoe 154 Vostochnoe Izmajlovo 154 Preobrazhenskoe 152 Novo-Peredelkino 149 Filevskij Park 148 Lomonosovskoe 147 Kotlovka 147 Juzhnoe Medvedkovo 143 Poselenie Pervomajskoe 142 Novokosino 139 Fili Davydkovo 137 Horoshevskoe 136 Levoberezhnoe 135 Donskoe 135 Vojkovskoe 131 Sviblovo 131 Zjablikovo 127 Troparevo-Nikulino 126 Lianozovo 126 Juzhnoportovoe 126 Ajeroport 123 Babushkinskoe 123 Jaroslavskoe 121 Lefortovo 119 Vostochnoe Degunino 118 Mar'ina Roshha 116 Birjulevo Zapadnoe 115 Matushkino 111 Savelki 105 Krylatskoe 103 Butyrskoe 101 Silino 100 Prospekt Vernadskogo 100 Alekseevskoe 100 Moskvorech'e-Saburovo 99 Basmannoe 98 Meshhanskoe 94 Staroe Krjukovo 92 Hamovniki 90 Savelovskoe 85 Marfino 85 Jakimanka 81 Ostankinskoe 79 Gagarinskoe 79 Nizhegorodskoe 77 Sokol 72 Altuf'evskoe 68 Rostokino 64 Kurkino 62 Sokol'niki 60 Begovoe 60 Metrogorodok 58 Dorogomilovo 56 Zamoskvorech'e 50 Kapotnja 49 Vnukovo 44 Krasnosel'skoe 37 Severnoe 37 Poselenie Rogovskoe 30 Poselenie Rjazanovskoe 26 Poselenie Kokoshkino 20 Poselenie Mosrentgen 19 Poselenie Krasnopahorskoe 19 Arbat 15 Vostochnoe 7 Poselenie Marushkinskoe 6 Molzhaninovskoe 3 Poselenie Voronovskoe 2 Name: sub_area, dtype: int64 ‘Poselenie Sosenskoe’ and ‘pOseleNie sosenskeo’ could refer to the same district.
          • How to handle? To avoid this, we can lowercase (or uppercase) all letters.
            df_copy['sub_area_lower'] = df_copy['sub_area'].str.lower()
            df_copy['sub_area_lower'].value_counts(dropna=False)
            
            poselenie sosenskoe 1617 nekrasovka 1611 poselenie vnukovskoe 1290 poselenie moskovskij 925 poselenie voskresenskoe 713 ... arbat 15 vostochnoe 7 poselenie marushkinskoe 6 molzhaninovskoe 3 poselenie voronovskoe 2 Name: sub_area_lower, Length: 141, dtype: int64
              Inconsistent type #2: typos of categorical values A categorical column takes on a limited and usually fixed number of possible values. Sometimes it shows other values due to reasons like typos.
                How to find out? Let’s see an example. Within the code below:
              • We generate a new DataFrame, df_city_ex There is only one column that stores the city names. There are misspellings. For example, ‘torontoo’ and ‘tronto’ both refer to the city of ‘toronto’.
              • The variable cities stores the 4 correct names of ‘toronto’, ‘vancouver’, ‘montreal’, and ‘calgary’.
              • To identify typos, we use fuzzy logic matches. We use edit_distance from nltk, which measures the number of operations (e.g., substitution, insertion, deletion) needed to change from one string into another string.
              • We calculate the distance between the actual values and the correct values.
                # Levenshtein Distance, Hamming distance
                df_city_ex = pd.DataFrame(data={'city': ['torontooo', 'toronto', 'turonto', 'vancouver', 'vancover', 'vancouvr', 'montreal', 'calgary']})
                df_city_ex['city'].value_counts()
                
                vancouvr 1 montreal 1 vancover 1 turonto 1 calgary 1 vancouver 1 torontooo 1 toronto 1 Name: city, dtype: int64
                #!pip install pyspellchecker
                
                from spellchecker import SpellChecker
                spell = SpellChecker()
                i=0
                for city in df_city_ex['city']:
                    # Get the one `most likely` answer
                    df_city_ex.at[i, 'city']= spell.correction(city)
                    i=i+1
                    #print(spell.correction(city))
                df_city_ex['city']
                
                0 toronto 1 toronto 2 toronto 3 vancouver 4 vancouver 5 vancouver 6 montreal 7 calgary Name: city, dtype: object
                df_city_ex['city'].value_counts()
                
                vancouver 3 toronto 3 montreal 1 calgary 1 Name: city, dtype: int64 --- Tags: [!AMLIndex](./!AMLIndex.md)