Pandas если ошибка - TopOshibok.ru - решение и исправление самых разных ошибок

If I have a DataFrame:

myDF = DataFrame(data=[[11,11],[22,'2A'],[33,33]], columns = ['A','B'])

Gives the following dataframe (Starting out on stackoverflow and don’t have enough reputation for an image of the DataFrame)

   | A  | B  |

0  | 11 | 11 |

1  | 22 | 2A |

2  | 33 | 33 |

If i want to convert column B to int values and drop values that can’t be converted I have to do:

def convertToInt(cell):
    try:
        return int(cell)
    except:
        return None
myDF['B'] = myDF['B'].apply(convertToInt)

If I only do:

myDF[‘B’].apply(int)

the error obviously is:

C:\WinPython-32bit-2.7.5.3\python-2.7.5\lib\site-packages\pandas\lib.pyd
in pandas.lib.map_infer (pandas\lib.c:42840)()

ValueError: invalid literal for int() with base 10: ‘2A’

Is there a way to add exception handling to myDF[‘B’].apply()

Thank you in advance!

asked Apr 3, 2014 at 19:36

RukTechRukTech

5,0635 gold badges22 silver badges23 bronze badges

I had the same question, but for a more general case where it was hard to tell if the function would generate an exception (i.e. you couldn’t explicitly check this condition with something as straightforward as isdigit).

After thinking about it for a while, I came up with the solution of embedding the try/except syntax in a separate function. I’m posting a toy example in case it helps anyone.

import pandas as pd
import numpy as np

x=pd.DataFrame(np.array([['a','a'], [1,2]]))

def augment(x):
    try:
        return int(x)+1
    except:
        return 'error:' + str(x)

x[0].apply(lambda x: augment(x))

answered Feb 22, 2017 at 15:39

atkat12atkat12

3,8507 gold badges22 silver badges22 bronze badges

A way to achieve that with lambda:

myDF['B'].apply(lambda x: int(x) if str(x).isdigit() else None)

For your input:

>>> myDF
    A   B
0  11  11
1  22  2A
2  33  33

[3 rows x 2 columns]

>>> myDF['B'].apply(lambda x: int(x) if str(x).isdigit() else None)
0    11
1   NaN
2    33
Name: B, dtype: float64

answered Apr 3, 2014 at 19:54

AmitAmit

19.8k6 gold badges46 silver badges54 bronze badges

much better/faster to do:

In [1]: myDF = DataFrame(data=[[11,11],[22,'2A'],[33,33]], columns = ['A','B'])

In [2]: myDF.convert_objects(convert_numeric=True)
Out[2]: 
    A   B
0  11  11
1  22 NaN
2  33  33

[3 rows x 2 columns]

In [3]: myDF.convert_objects(convert_numeric=True).dtypes
Out[3]: 
A      int64
B    float64
dtype: object

This is a vectorized method of doing just this. The coerce flag say to mark as nan anything that cannot be converted to numeric.

You can of course do this to a single column if you’d like.

answered Apr 3, 2014 at 20:20

JeffJeff

126k21 gold badges220 silver badges189 bronze badges

Источник

error messages

it’s always better to be more specific about the cause of an error:

x = -1
if not isinstance(x,str): ## check if x is a str
    errstr = "x is of type "+type(x).__name__+", should be str"
    raise TypeError(errstr)

TypeError: x is of type int, should be str

f-strings are a convenient way to construct error messages: anything inside curly brackets is interpreted as a Python expression. e.g.

x=1
print(f"x is of type {type(x).__name__}, should be str")

## x is of type int, should be str

So we could use

if not isinstance(x,str): ## check if x is a str
    raise TypeError("x is of type {type(x).__name__}, should be str")

x = -1
if x<0:
    raise ValueError(f"x should be non-negative, but it equals {x}")

ValueError: x should be non-negative, but it equals -1

handling errors

Now suppose you are getting an error and you don’t want your program to stop. “Wrapping” your code in a try: clause will allow you to specify what to do in this case. pass is a special Python statement called a “null operation” or a “no-op”; it does nothing except keep going.

try:
    x= math.sqrt(-1)
except:
    pass
## keep going (but x will not be set)

You can specify something you want to do with only a particular set of errors:

try:
    x = math.sqrt(-1)
except ValueError: 
    print("a ValueError occurred")
except:
    print("some other error occurred")
## keep going (but x will not be set)

## a ValueError occurred

If the error isn’t caught because it isn’t the right type, it will act like it normally does (without the try:)

try:
    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")

NameError: name 'z' is not defined

We could catch this with a general-purpose except:

try:
    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")
except:
    print("some other error occurred")

## some other error occurred

Or add another clause to catch it:

try:
    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")
except NameError:
    print("a NameError occurred")
except:
    print("some other error occurred")

## a NameError occurred

Data frames

rectangular data structure, looks a lot like an array.
each column is a Series; each column can be of a different type
rows and columns act differently
can index by (column) labels as well as positions
handles missing data (NaN)
convenient plotting
fast operations with keys
lots of facilities for input/output

import pandas as pd  ## standard abbreviation
# The initial set of baby names and birth rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
## initialize DataFrame with a *dictionary*
p = pd.DataFrame({'Name': names, 'Count': births})
print(p)

##       Name  Count
## 0      Bob    968
## 1  Jessica    155
## 2     Mary     77
## 3     John    578
## 4      Mel    973

What can we do with it?

“Simple” indexing
- Indexing (a single value) selects a column by its key
- key could be a number, if column names weren’t given when setting up the data frame
- Slicing selects rows by number
- indexing with a list gives multiple columns
- .iloc gives row/column indices (like an array)

p["Count"]  ## extract a column = Series (by *name*)
p[2:3]      ## slice one row (3-2 = 1)
p[2:5]      ## slice multiple rows
p[["Name","Count"]]    ## extract multiple columns (data frame)
p.iloc[1,1]     ## index with row/column integers like an array
p.iloc[0:5,:]   ## can also slice

Indexing by name

p["Name"][4]  ## 5th element of Name
p.Name  ## attribute!
p.loc[1:2,"Name"]  ## index by *label*, _inclusive_

Measles data

Download US measles data from Project Tycho.

read_csv reads a CSV file as a data frame; it automatically interprets the first row as headings
df.iloc[] indexes the result as though it were an array
df.head() shows just at the beginning; df.tail() shows just the end

Let’s look at the first few rows of a data set on measles in US states:

## "Weekly Measles Cases, 1909-2001"
## ...
## "Data provided by Project Tycho, Data Version 1.0.0, released 28 Novem...
## "YEAR","WEEK","ALABAMA","ALASKA","AMERICAN SAMOA","ARIZONA","ARKANSAS"...
## 1909,1,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,2,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,3,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...

fn = "../data/MEASLES_Cases_1909-2001_20150322001618.csv"
p  = pd.read_csv(fn,skiprows=2,na_values=["-"])  ## read in data
p.head()                     ## look at the first little bit

##    YEAR  WEEK  ALABAMA  ALASKA  ...  WEST VIRGINIA  WISCONSIN  WYOMING  Unnamed: 61
## 0  1909     1      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 1  1909     2      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 2  1909     3      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 3  1909     4      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 4  1909     5      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 
## [5 rows x 62 columns]

Mostly NaN values at the beginning! (NaN = “not a number”: similar to nan from math or numpy)

Selecting

Like numpy array indexing, but a little different …
Pandas doc, indexing and selecting
- extract by name: df.loc[:,"MASSACHUSETTS":"NEVADA"] (index by label; includes endpoint)
- extract by integer index: iloc method, df.iloc[:,range] (index by integer; doesn’t include endpoint)

p.loc[:,"MASSACHUSETTS":"NEVADA"]

##       MASSACHUSETTS  MICHIGAN  MINNESOTA  ...  MONTANA  NEBRASKA  NEVADA
## 0               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 1               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 2               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 3               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4               NaN       NaN        NaN  ...      NaN       NaN     NaN
## ...             ...       ...        ...  ...      ...       ...     ...
## 4856            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4857            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4858            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4859            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4860            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 
## [4861 rows x 8 columns]

This is the same:

pc = list(p.columns) ## list of colum names
print(pc[:5])
## find the locations of these two state names

## ['YEAR', 'WEEK', 'ALABAMA', 'ALASKA', 'AMERICAN SAMOA']

mass_ind = list(pc).index("MASSACHUSETTS")
neva_ind = list(pc).index("NEVADA")
## index using `.iloc` (with extended range)
p.iloc[:,mass_ind:neva_ind+1]

##       MASSACHUSETTS  MICHIGAN  MINNESOTA  ...  MONTANA  NEBRASKA  NEVADA
## 0               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 1               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 2               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 3               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4               NaN       NaN        NaN  ...      NaN       NaN     NaN
## ...             ...       ...        ...  ...      ...       ...     ...
## 4856            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4857            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4858            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4859            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4860            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 
## [4861 rows x 8 columns]

More examples

You can also refer to individual columns as attributes (i.e. just p.<name>)

p.ARIZONA[:5]

## 0   NaN
## 1   NaN
## 2   NaN
## 3   NaN
## 4   NaN
## Name: ARIZONA, dtype: float64

p.ARIZONA.head()

## 0   NaN
## 1   NaN
## 2   NaN
## 3   NaN
## 4   NaN
## Name: ARIZONA, dtype: float64

.drop() gets rid of elements

pp = p.drop(["YEAR","WEEK"],axis=1)
## equivalent to
pp2 = p.iloc[2:,]
pp3 = p.loc[:,"ARIZONA"]

Always use name-indexing whenever you can!

.index is a special attribute of data frames that governs searching, plotting, etc.. Here we’ll set it to a decimal date value:

pp.index = p.YEAR+(p.WEEK-1)/52

Filtering

Choosing specific rows of a data frame; &, | ,~ correspond to and, or, not (individual elements must be in parentheses)

ariz = p.ARIZONA                                ## pull out a column (attribute)
ariz[(p.YEAR==1970) & (ariz>50)]                ## *must* use parentheses!

## 3196    69.0
## 3197    57.0
## 3198    62.0
## 3200    56.0
## 3203    73.0
## 3205    54.0
## 3209    55.0
## Name: ARIZONA, dtype: float64

Basic plotting

pandas will automatically plot data frames in a (reasonably) sensible way

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
## pp.plot()
pp.plot(legend=False,logy=True)                 ## plot method (non-Pythonic)
plt.savefig("pix/measles1.png")

Or we can create our own (less complex) plots

import numpy as np
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(pp.index,np.log10(pp.ARIZONA))

Column and row manipulations

totals by week

ptot = pp.sum(axis=1)

df.min, df.max, df.mean all work too …

Aggregation

ptotweek = ptot.groupby(p.WEEK)
ptotweekmean = ptotweek.aggregate(np.mean)
ptotweekmean.plot()

Dates and times

reference

(Another) complex subject.
Lots of possible date formats
Basic idea: something like %Y-%m-%d; separators just match whatever’s in your data (usually “/” or “-”). Results need to be unambiguous, and ambiguity is dangerous (how is day of month specified? lower case, capital? etc.)
pandas tries to guess, but you shouldn’t let it.

print(pd.to_datetime("05-01-2004"))

## 2004-05-01 00:00:00

print(pd.to_datetime("05-01-2004",format="%m-%d-%Y"))

## 2004-05-01 00:00:00

Time zones and daylight savings time can be a nightmare
May need to have the right number of digits, especially in the absence of separators:

import pandas as pd
print(pd.to_datetime("1212004",format="%m%d%Y"))

## 2004-12-01 00:00:00

print(pd.to_datetime("12012004",format="%m%d%Y"))

## 2004-12-01 00:00:00

For our measles data we have week of year, so things get a little complicated

yearstr = p.YEAR.apply(format)
weekstr = p.WEEK.apply(format,args=["02"])
datestr = p.YEAR.astype(str)+"-"+weekstr+"-0"
dateindex = pd.to_datetime(datestr,format="%Y-%U-%w")

Binning results

turn a quantitative variable into categories
pd.cut(x,bins=...); decide on bins
pd.qcut(x,n); decide on number of bins (equal occupancy)

Weather data

## fancy stuff: automatically look for index and convert it to a date/time
p = pd.read_csv("../data/eng2.csv",skiprows=14,encoding="latin1",index_col="Date/Time",parse_dates=True)
## rename columns
p.columns = [
    'Year', 'Month', 'Day', 'Time', 'Data Quality', 'Temp (C)', 
    'Temp Flag', 'Dew Point Temp (C)', 'Dew Point Temp Flag', 
    'Rel Hum (%)', 'Rel Hum Flag', 'Wind Dir (10s deg)', 'Wind Dir Flag', 
    'Wind Spd (km/h)', 'Wind Spd Flag', 'Visibility (km)', 'Visibility Flag',
    'Stn Press (kPa)', 'Stn Press Flag', 'Hmdx', 'Hmdx Flag', 'Wind Chill', 
    'Wind Chill Flag', 'Weather']
## drop columns that are *all* NA
p = p.dropna(axis=1,how='all')
p["Temp (C)"].plot()
## get rid of columns (axis=1) we don't want
p = p.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)

Now pull out the temperature and take the median by hour:

temp = p[['Temp (C)']]
temp["Hour"] = temp.index.hour

## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

temphr = temp.groupby('Hour')
medtmp = temphr.aggregate(np.median)
maxtmp = temphr.aggregate(np.max)
mintmp = temphr.aggregate(np.min)

Now plot these …

Источник

На чтение 2 мин. Просмотров 10k. Опубликовано 27.11.2016

При использовании библиотеки Pandas очень удобно загружать данные из разных источников, например из файлов с помощью функции read_csv. Все работает из коробки, много разных опций. Но если в данных, которые нужно загрузить, закралась ошибка, то тут каши не сваришь. Конечно, для анализа можно пренебречь некоторой частью данных при загрузке, то есть запустить функцию с параметром error_bad_lines=False. Тогда все строки, в которых есть ошибка, будут проигнорированы. Такой способ подходит для быстрого анализа или когда количество строк с ошибками несущественно по сравнению с размером данных. Но для точного анализа нужно загрузить все данные, то есть обработать ошибочные строки и запихнуть их в датафрейм.

Одна из возможных ошибок выглядит так

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 20 fields in line 47773, saw 22

что говорит нам следующее: количество столбцов в строке не совпадает с начальным, который pandas взял за основу (по первой строке).

Мне очень понравилось решение, которое я нашел на сайте stackoverflow.com

Смысл его в том, что каждую ошибку, на которой спотыкается pandas, ловим через Exception, обрабатываем и записываем строку, на которой произошел сбой. После отработки pandas всего файла, берутся все строки с ошибками и обрабатываются по своему желанию.

В данной реализации хорошо, что, во-первых, делается максимум работы с помощью встроенной функции read_csv, во-вторых, можно настроить парсинг на определенные ошибки, в-третьих, можно впихнуть свою функцию для парсинга данных.

Отработка ошибок делается в лоб: открывается файл, читаются только строки с ошибками, разбираются через свою функцию парсера, возвращается все в виде массива. После конкатируются два датафрейма.

Не стоит забывать, что в python3 нет нормального решения для чтения из файла определенной строки, поэтому сделано все по-простому через открытие файла и чтение строки через функцию readline

# python3 extended function to read csv in Pandas

def pandasReadCsvExtended(files,sep=‘,’, func_parser=», func_afterparser = »):

dataRez = pd.DataFrame()

for file in files:

line = []

dataFile = pd.DataFrame()

cont = True

while cont == True:

try:

dataFile = pd.read_csv(file, sep=sep, skiprows=line)

cont = False

except Exception as e:

errormsg = e.args[0]

errortype = errormsg.split(‘.’)[0].strip()

if errortype == ‘Error tokenizing data’:

cerror = errormsg.split(‘:’)[1].strip().replace(‘,’, »)

nums = [n for n in cerror.split(‘ ‘) if str.isdigit(n)]

line.append(int(nums[1]) — 1)

else:

print(‘Unknown Error: {}’.format(errormsg))

if line != [] and callable(func_parser):

fileIO = open(file, ‘r’)

toDataFrame = []

ln_prev = 0

# count of columns in dataFrame

header_count = dataFile.shape[1]

for ln in line:

lineSteps = ln — ln_prev

tmp = ln_prev

for i in range(lineSteps):

tmp += 1

fileIO.readline()

parts = func_parser(fileIO.readline().rstrip(), header_count)

toDataFrame.append(parts)

ln_prev = tmp + 1

dataErrors = pd.DataFrame.from_records(toDataFrame, columns=dataFile.columns.values.tolist())

dataFile = pd.concat([dataFile, dataErrors], ignore_index=True)

if callable(func_afterparser):

func_afterparser(dataFile, file)

dataRez = pd.concat([dataRez, dataFile], ignore_index=True)

return dataRez

Источник

Please mind you, I’m new to Pandas/Python and I don’t know what I’m doing.

I’m working with CSV files and I basically filter currencies.
Every other day, the exported CSV file may contain or not contain certain currencies.

I have several such cells of codes—

AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()

When the CSV doesn’t contain AUD, I get ValueError: cannot set a frame with no defined columns.
What I’d like to produce is a function or an if statement or a loop that checks if the column contains AUD, and if it does, it runs the above code, and if it doesn’t, it simply skips it and proceeds to the next line of code for the next currency.

Any idea how I can accomplish this?

Thanks in advance.

asked Mar 22, 2021 at 6:07

This can be done in 2 ways:

You can create a try and except statement, this will try and look for the given currency and if a ValueError occurs it will skip and move on:

try:
    AUDdf = df.loc[df['Currency'] == 'AUD']
    AUDtable = pd.pivot_table(AUDdf,index=["Username"],values["Amount"],aggfunc=np.sum)
    AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
    AUDdesc = AUDdf['Amount'].describe()
except ValueError:
    pass

You can create an if statement which looks for the currencies presence first:

currency_set = set(list(df['Currency'].values))

if 'AUD' in currency_set:
    AUDdf = df.loc[df['Currency'] == 'AUD']
    AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
    AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
    AUDdesc = AUDdf['Amount'].describe()

answered Mar 22, 2021 at 6:19

1.Worst way to skip over the error/exception:

    try:
        <Your Code>
    except:
        pass

The above is probably the worst way because you want to know when an exception occur. using generic Except statements is bad practice because you want to avoid «catch em all» code. You want to be catching exceptions that you know how to handle. You want to know what specific exception occurred and you need to handle them on an exception-by-exception basis. Writing Generic except statements leads to missed bugs and tends to mislead while running the code to test.

Slightly worse way to handle the exception:
```
try:
   <Your Code>
except Exception as e:
   <Some code to handle an exception>
```
Still not optimal as it is still generic handling

Average way to handle it for your case:

 try:
   <Your Code>
 except ValueError:
   <Some code to handle this exception>

Other suggestion — Much Better Ways to deal with this:

1.You can get a set of the available columns at run time and aggregate based on if ‘AUD’ is in the list.

2.Clean your data set

answered Mar 22, 2021 at 6:29

Sandeep_RaoSandeep_Rao

1011 silver badge10 bronze badges

You can use try and except where

try:
    #your code here
except:
    #some print statement 
    pass

answered Mar 22, 2021 at 6:17

Источник

Pandas is a powerful library for data analysis in Python. It offers a variety of functions to perform various operations on data. One of the most useful operations is the apply() function. It is used to apply a function to each row or column of a DataFrame. However, sometimes the data may contain errors or inconsistencies, which can cause the apply() function to fail. To handle such errors, we can use the try-except block within the apply() function.

Understanding the Try-Except Block

The try-except block is used to handle exceptions in Python. Exceptions are errors that occur during the execution of a program. When an exception occurs, the program stops executing and raises an error message. To prevent the program from crashing, we can use a try-except block to catch the exception and handle it gracefully.

The basic syntax of the try-except block is as follows:

try:
   # Code block that may raise an exception
except ExceptionType:
   # Code block to handle the exception

In the above syntax, we try to execute a code block that may raise an exception. If an exception occurs, the code block within the except block is executed to handle the exception.

Embedding Try-Except in Pandas Apply Operation

To embed the try-except block in Pandas apply() function, we can define a function that contains the try-except block and then pass this function to the apply() function. Here’s an example:

import pandas as pd

def my_function(x):
    try:
        # Code block that may raise an exception
    except:
        # Code block to handle the exception
    return result

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.apply(my_function)

In the above code, we define a function called my_function that contains the try-except block. We then pass this function to the apply() function of the DataFrame. The apply() function applies the function to each row or column of the DataFrame. If an exception occurs in any row or column, the try-except block within the function handles it gracefully.

Conclusion

In conclusion, the try-except block in Pandas apply() function is a powerful tool to handle exceptions that occur during data analysis. By embedding the try-except block within the apply() function, we can handle errors and inconsistencies in the data, and prevent the program from crashing. By following the guidelines in this beginner’s guide, you can easily embed the try-except block in your Pandas code and take full advantage of the power of the apply() function.

Источник

error messages

handling errors

Data frames

Measles data

Selecting

More examples

Filtering

Basic plotting

Column and row manipulations

Aggregation

Dates and times

Binning results

Weather data

Understanding the Try-Except Block

Embedding Try-Except in Pandas Apply Operation

Conclusion

Интересное по теме: