Pandas если ошибка

If I have a DataFrame:

myDF = DataFrame(data=[[11,11],[22,'2A'],[33,33]], columns = ['A','B'])

Gives the following dataframe (Starting out on stackoverflow and don’t have enough reputation for an image of the DataFrame)

   | A  | B  |

0  | 11 | 11 |

1  | 22 | 2A |

2  | 33 | 33 |

If i want to convert column B to int values and drop values that can’t be converted I have to do:

def convertToInt(cell):
        return int(cell)
        return None
myDF['B'] = myDF['B'].apply(convertToInt)

If I only do:


the error obviously is:

in pandas.lib.map_infer (pandas\lib.c:42840)()

ValueError: invalid literal for int() with base 10: ‘2A’

Is there a way to add exception handling to myDF[‘B’].apply()

Thank you in advance!

asked Apr 3, 2014 at 19:36

I had the same question, but for a more general case where it was hard to tell if the function would generate an exception (i.e. you couldn’t explicitly check this condition with something as straightforward as isdigit).

After thinking about it for a while, I came up with the solution of embedding the try/except syntax in a separate function. I’m posting a toy example in case it helps anyone.

import pandas as pd
import numpy as np

x=pd.DataFrame(np.array([['a','a'], [1,2]]))

def augment(x):
        return int(x)+1
        return 'error:' + str(x)

x[0].apply(lambda x: augment(x))

answered Feb 22, 2017 at 15:39

A way to achieve that with lambda:

myDF['B'].apply(lambda x: int(x) if str(x).isdigit() else None)

For your input:

>>> myDF
    A   B
0  11  11
1  22  2A
2  33  33

[3 rows x 2 columns]

>>> myDF['B'].apply(lambda x: int(x) if str(x).isdigit() else None)
0    11
1   NaN
2    33
Name: B, dtype: float64

answered Apr 3, 2014 at 19:54

much better/faster to do:

In [1]: myDF = DataFrame(data=[[11,11],[22,'2A'],[33,33]], columns = ['A','B'])

In [2]: myDF.convert_objects(convert_numeric=True)
    A   B
0  11  11
1  22 NaN
2  33  33

[3 rows x 2 columns]

In [3]: myDF.convert_objects(convert_numeric=True).dtypes
A      int64
B    float64
dtype: object

This is a vectorized method of doing just this. The coerce flag say to mark as nan anything that cannot be converted to numeric.

You can of course do this to a single column if you’d like.

answered Apr 3, 2014 at 20:20

error messages

  • it’s always better to be more specific about the cause of an error:
x = -1
if not isinstance(x,str): ## check if x is a str
    errstr = "x is of type "+type(x).__name__+", should be str"
    raise TypeError(errstr)
TypeError: x is of type int, should be str

f-strings are a convenient way to construct error messages: anything inside curly brackets is interpreted as a Python expression. e.g. 

print(f"x is of type {type(x).__name__}, should be str")
## x is of type int, should be str

So we could use

if not isinstance(x,str): ## check if x is a str
    raise TypeError("x is of type {type(x).__name__}, should be str")
x = -1
if x<0:
    raise ValueError(f"x should be non-negative, but it equals {x}")
ValueError: x should be non-negative, but it equals -1

handling errors

Now suppose you are getting an error and you don’t want your program to stop. “Wrapping” your code in a try: clause will allow you to specify what to do in this case. pass is a special Python statement called a “null operation” or a “no-op”; it does nothing except keep going.

    x= math.sqrt(-1)
## keep going (but x will not be set)

You can specify something you want to do with only a particular set of errors:

    x = math.sqrt(-1)
except ValueError: 
    print("a ValueError occurred")
    print("some other error occurred")
## keep going (but x will not be set)
## a ValueError occurred

If the error isn’t caught because it isn’t the right type, it will act like it normally does (without the try:)

    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")
NameError: name 'z' is not defined

We could catch this with a general-purpose except:

    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")
    print("some other error occurred")
## some other error occurred

Or add another clause to catch it:

    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")
except NameError:
    print("a NameError occurred")
    print("some other error occurred")
## a NameError occurred

Data frames

  • rectangular data structure, looks a lot like an array.
  • each column is a Series; each column can be of a different type
  • rows and columns act differently
  • can index by (column) labels as well as positions
  • handles missing data (NaN)
  • convenient plotting
  • fast operations with keys
  • lots of facilities for input/output
import pandas as pd  ## standard abbreviation
# The initial set of baby names and birth rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
## initialize DataFrame with a *dictionary*
p = pd.DataFrame({'Name': names, 'Count': births})
##       Name  Count
## 0      Bob    968
## 1  Jessica    155
## 2     Mary     77
## 3     John    578
## 4      Mel    973

What can we do with it?

  • “Simple” indexing
    • Indexing (a single value) selects a column by its key
    • key could be a number, if column names weren’t given when setting up the data frame
    • Slicing selects rows by number
    • indexing with a list gives multiple columns
    • .iloc gives row/column indices (like an array)
p["Count"]  ## extract a column = Series (by *name*)
p[2:3]      ## slice one row (3-2 = 1)
p[2:5]      ## slice multiple rows
p[["Name","Count"]]    ## extract multiple columns (data frame)
p.iloc[1,1]     ## index with row/column integers like an array
p.iloc[0:5,:]   ## can also slice

Indexing by name

p["Name"][4]  ## 5th element of Name
p.Name  ## attribute!
p.loc[1:2,"Name"]  ## index by *label*, _inclusive_

Measles data

Download US measles data from Project Tycho.

  • read_csv reads a CSV file as a data frame; it automatically interprets the first row as headings
  • df.iloc[] indexes the result as though it were an array
  • df.head() shows just at the beginning; df.tail() shows just the end

Let’s look at the first few rows of a data set on measles in US states:

## "Weekly Measles Cases, 1909-2001"
## ...
## "Data provided by Project Tycho, Data Version 1.0.0, released 28 Novem...
## 1909,1,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,2,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,3,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
fn = "../data/MEASLES_Cases_1909-2001_20150322001618.csv"
p  = pd.read_csv(fn,skiprows=2,na_values=["-"])  ## read in data
p.head()                     ## look at the first little bit
## 0  1909     1      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 1  1909     2      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 2  1909     3      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 3  1909     4      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 4  1909     5      NaN     NaN  ...            NaN        NaN      NaN          NaN
## [5 rows x 62 columns]

Mostly NaN values at the beginning! (NaN = “not a number”: similar to nan from math or numpy)


  • Like numpy array indexing, but a little different …
  • Pandas doc, indexing and selecting
    • extract by name: df.loc[:,"MASSACHUSETTS":"NEVADA"] (index by label; includes endpoint)
    • extract by integer index: iloc method, df.iloc[:,range] (index by integer; doesn’t include endpoint)
## 0               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 1               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 2               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 3               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4               NaN       NaN        NaN  ...      NaN       NaN     NaN
## ...             ...       ...        ...  ...      ...       ...     ...
## 4856            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4857            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4858            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4859            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4860            NaN       NaN        NaN  ...      NaN       NaN     NaN
## [4861 rows x 8 columns]

This is the same:

pc = list(p.columns) ## list of colum names
## find the locations of these two state names
mass_ind = list(pc).index("MASSACHUSETTS")
neva_ind = list(pc).index("NEVADA")
## index using `.iloc` (with extended range)
## 0               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 1               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 2               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 3               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4               NaN       NaN        NaN  ...      NaN       NaN     NaN
## ...             ...       ...        ...  ...      ...       ...     ...
## 4856            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4857            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4858            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4859            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4860            NaN       NaN        NaN  ...      NaN       NaN     NaN
## [4861 rows x 8 columns]

More examples

You can also refer to individual columns as attributes (i.e. just p.<name>)

## 0   NaN
## 1   NaN
## 2   NaN
## 3   NaN
## 4   NaN
## Name: ARIZONA, dtype: float64
## 0   NaN
## 1   NaN
## 2   NaN
## 3   NaN
## 4   NaN
## Name: ARIZONA, dtype: float64

.drop() gets rid of elements

pp = p.drop(["YEAR","WEEK"],axis=1)
## equivalent to
pp2 = p.iloc[2:,]
pp3 = p.loc[:,"ARIZONA"]

Always use name-indexing whenever you can!

.index is a special attribute of data frames that governs searching, plotting, etc.. Here we’ll set it to a decimal date value:

pp.index = p.YEAR+(p.WEEK-1)/52


Choosing specific rows of a data frame; &, | ,~ correspond to and, or, not (individual elements must be in parentheses)

ariz = p.ARIZONA                                ## pull out a column (attribute)
ariz[(p.YEAR==1970) & (ariz>50)]                ## *must* use parentheses!
## 3196    69.0
## 3197    57.0
## 3198    62.0
## 3200    56.0
## 3203    73.0
## 3205    54.0
## 3209    55.0
## Name: ARIZONA, dtype: float64

Basic plotting

pandas will automatically plot data frames in a (reasonably) sensible way

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
## pp.plot()
pp.plot(legend=False,logy=True)                 ## plot method (non-Pythonic)

Or we can create our own (less complex) plots

import numpy as np
fig = plt.figure()
ax = fig.add_subplot(1,1,1)

Column and row manipulations

  • totals by week
ptot = pp.sum(axis=1)
  • df.min, df.max, df.mean all work too …


ptotweek = ptot.groupby(p.WEEK)
ptotweekmean = ptotweek.aggregate(np.mean)

Dates and times


  • (Another) complex subject.
  • Lots of possible date formats
  • Basic idea: something like %Y-%m-%d; separators just match whatever’s in your data (usually “/” or “-”). Results need to be unambiguous, and ambiguity is dangerous (how is day of month specified? lower case, capital? etc.)
  • pandas tries to guess, but you shouldn’t let it.
## 2004-05-01 00:00:00
## 2004-05-01 00:00:00
  • Time zones and daylight savings time can be a nightmare
  • May need to have the right number of digits, especially in the absence of separators:
import pandas as pd
## 2004-12-01 00:00:00
## 2004-12-01 00:00:00

For our measles data we have week of year, so things get a little complicated

yearstr = p.YEAR.apply(format)
weekstr = p.WEEK.apply(format,args=["02"])
datestr = p.YEAR.astype(str)+"-"+weekstr+"-0"
dateindex = pd.to_datetime(datestr,format="%Y-%U-%w")

Binning results

  • turn a quantitative variable into categories
  • pd.cut(x,bins=...); decide on bins
  • pd.qcut(x,n); decide on number of bins (equal occupancy)

Weather data

## fancy stuff: automatically look for index and convert it to a date/time
p = pd.read_csv("../data/eng2.csv",skiprows=14,encoding="latin1",index_col="Date/Time",parse_dates=True)
## rename columns
p.columns = [
    'Year', 'Month', 'Day', 'Time', 'Data Quality', 'Temp (C)', 
    'Temp Flag', 'Dew Point Temp (C)', 'Dew Point Temp Flag', 
    'Rel Hum (%)', 'Rel Hum Flag', 'Wind Dir (10s deg)', 'Wind Dir Flag', 
    'Wind Spd (km/h)', 'Wind Spd Flag', 'Visibility (km)', 'Visibility Flag',
    'Stn Press (kPa)', 'Stn Press Flag', 'Hmdx', 'Hmdx Flag', 'Wind Chill', 
    'Wind Chill Flag', 'Weather']
## drop columns that are *all* NA
p = p.dropna(axis=1,how='all')
p["Temp (C)"].plot()
## get rid of columns (axis=1) we don't want
p = p.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)

Now pull out the temperature and take the median by hour:

temp = p[['Temp (C)']]
temp["Hour"] = temp.index.hour
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## See the caveats in the documentation:
temphr = temp.groupby('Hour')
medtmp = temphr.aggregate(np.median)
maxtmp = temphr.aggregate(np.max)
mintmp = temphr.aggregate(np.min)

Now plot these …

На чтение 2 мин. Просмотров 10k. Опубликовано

При использовании библиотеки Pandas очень удобно загружать данные из разных источников, например из файлов с помощью функции read_csv. Все работает из коробки, много разных опций. Но если в данных, которые нужно загрузить, закралась ошибка, то тут каши не сваришь. Конечно, для анализа можно пренебречь некоторой частью данных при загрузке, то есть запустить функцию с параметром error_bad_lines=False. Тогда все строки, в которых есть ошибка, будут проигнорированы. Такой способ подходит для быстрого анализа или когда количество строк с ошибками несущественно по сравнению с размером данных. Но для точного анализа нужно загрузить все данные, то есть обработать ошибочные строки и запихнуть их в датафрейм.

Одна из возможных ошибок выглядит так Error tokenizing data. C error: Expected 20 fields in line 47773, saw 22

что говорит нам следующее: количество столбцов в строке не совпадает с начальным, который pandas взял за основу (по первой строке).

Мне очень понравилось решение, которое я нашел на сайте

Смысл его в том, что каждую ошибку, на которой спотыкается pandas, ловим через Exception, обрабатываем и записываем строку, на которой произошел сбой. После отработки pandas всего файла, берутся все строки с ошибками и обрабатываются по своему желанию.

В данной реализации хорошо, что, во-первых, делается максимум работы с помощью встроенной функции read_csv, во-вторых, можно настроить парсинг на определенные ошибки, в-третьих, можно впихнуть свою функцию для парсинга данных.

Отработка ошибок делается в лоб: открывается файл, читаются только строки с ошибками, разбираются через свою функцию парсера, возвращается все в виде массива. После конкатируются два датафрейма.

Не стоит забывать, что в python3 нет нормального решения для чтения из файла определенной строки, поэтому сделано все по-простому через открытие файла и чтение строки через функцию readline










































# python3 extended function to read csv in Pandas

def pandasReadCsvExtended(files,sep=‘,’, func_parser=», func_afterparser = »):

    dataRez = pd.DataFrame()

    for file in files:

        line = []

        dataFile = pd.DataFrame()

        cont = True

        while cont == True:


                dataFile = pd.read_csv(file, sep=sep, skiprows=line)

                cont = False

            except Exception as e:

                errormsg = e.args[0]

                errortype = errormsg.split(‘.’)[0].strip()

                if errortype == ‘Error tokenizing data’:

                    cerror = errormsg.split(‘:’)[1].strip().replace(‘,’, »)

                    nums = [n for n in cerror.split(‘ ‘) if str.isdigit(n)]

                    line.append(int(nums[1]) 1)


                    print(‘Unknown Error: {}’.format(errormsg))

        if line != [] and callable(func_parser):

            fileIO = open(file, ‘r’)

            toDataFrame = []

            ln_prev = 0

            # count of columns in dataFrame

            header_count = dataFile.shape[1]

            for ln in line:

                lineSteps = ln ln_prev

                tmp = ln_prev

                for i in range(lineSteps):

                    tmp += 1


                parts = func_parser(fileIO.readline().rstrip(), header_count)


                ln_prev = tmp + 1

            dataErrors = pd.DataFrame.from_records(toDataFrame, columns=dataFile.columns.values.tolist())

            dataFile = pd.concat([dataFile, dataErrors], ignore_index=True)

        if callable(func_afterparser):

            func_afterparser(dataFile, file)

        dataRez = pd.concat([dataRez, dataFile], ignore_index=True)

    return dataRez

Please mind you, I’m new to Pandas/Python and I don’t know what I’m doing.

I’m working with CSV files and I basically filter currencies.
Every other day, the exported CSV file may contain or not contain certain currencies.

I have several such cells of codes—

AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()

When the CSV doesn’t contain AUD, I get ValueError: cannot set a frame with no defined columns.
What I’d like to produce is a function or an if statement or a loop that checks if the column contains AUD, and if it does, it runs the above code, and if it doesn’t, it simply skips it and proceeds to the next line of code for the next currency.

Any idea how I can accomplish this?

Thanks in advance.

asked Mar 22, 2021 at 6:07

This can be done in 2 ways:

  1. You can create a try and except statement, this will try and look for the given currency and if a ValueError occurs it will skip and move on:
    AUDdf = df.loc[df['Currency'] == 'AUD']
    AUDtable = pd.pivot_table(AUDdf,index=["Username"],values["Amount"],aggfunc=np.sum)
    AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
    AUDdesc = AUDdf['Amount'].describe()
except ValueError:
  1. You can create an if statement which looks for the currencies presence first:
currency_set = set(list(df['Currency'].values))

if 'AUD' in currency_set:
    AUDdf = df.loc[df['Currency'] == 'AUD']
    AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
    AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
    AUDdesc = AUDdf['Amount'].describe()

answered Mar 22, 2021 at 6:19

1.Worst way to skip over the error/exception:

        <Your Code>

The above is probably the worst way because you want to know when an exception occur. using generic Except statements is bad practice because you want to avoid «catch em all» code. You want to be catching exceptions that you know how to handle. You want to know what specific exception occurred and you need to handle them on an exception-by-exception basis. Writing Generic except statements leads to missed bugs and tends to mislead while running the code to test.

  1. Slightly worse way to handle the exception:

       <Your Code>
    except Exception as e:
       <Some code to handle an exception>

    Still not optimal as it is still generic handling

  2. Average way to handle it for your case:

       <Your Code>
     except ValueError:
       <Some code to handle this exception>

Other suggestion — Much Better Ways to deal with this:

1.You can get a set of the available columns at run time and aggregate based on if ‘AUD’ is in the list.

2.Clean your data set

answered Mar 22, 2021 at 6:29

You can use try and except where

    #your code here
    #some print statement 

answered Mar 22, 2021 at 6:17

Pandas is a powerful library for data analysis in Python. It offers a variety of functions to perform various operations on data. One of the most useful operations is the apply() function. It is used to apply a function to each row or column of a DataFrame. However, sometimes the data may contain errors or inconsistencies, which can cause the apply() function to fail. To handle such errors, we can use the try-except block within the apply() function.

Understanding the Try-Except Block

The try-except block is used to handle exceptions in Python. Exceptions are errors that occur during the execution of a program. When an exception occurs, the program stops executing and raises an error message. To prevent the program from crashing, we can use a try-except block to catch the exception and handle it gracefully.

The basic syntax of the try-except block is as follows:

   # Code block that may raise an exception
except ExceptionType:
   # Code block to handle the exception

In the above syntax, we try to execute a code block that may raise an exception. If an exception occurs, the code block within the except block is executed to handle the exception.

Embedding Try-Except in Pandas Apply Operation

To embed the try-except block in Pandas apply() function, we can define a function that contains the try-except block and then pass this function to the apply() function. Here’s an example:

import pandas as pd

def my_function(x):
        # Code block that may raise an exception
        # Code block to handle the exception
    return result

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

In the above code, we define a function called my_function that contains the try-except block. We then pass this function to the apply() function of the DataFrame. The apply() function applies the function to each row or column of the DataFrame. If an exception occurs in any row or column, the try-except block within the function handles it gracefully.


In conclusion, the try-except block in Pandas apply() function is a powerful tool to handle exceptions that occur during data analysis. By embedding the try-except block within the apply() function, we can handle errors and inconsistencies in the data, and prevent the program from crashing. By following the guidelines in this beginner’s guide, you can easily embed the try-except block in your Pandas code and take full advantage of the power of the apply() function.

